Observation 


f 


Uncertainty  in  Gaussian  Sensor 
Networks 


Anand  D.  Sarwate 


Electrical  Engineering  and  Computer  Sciences 
University  of  California  at  Berkeley 


Technical  Report  No.  UCB/EECS-2006-3 

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-3.html 

January  23,  2006 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

23  JAN  2006  2.  report  type 

3.  DATES  COVERED 

00-00-2006  to  00-00-2006 

4.  TITLE  AND  SUBTITLE 

Observation  Uncertainty  in  Gaussian  Sensor  Networks 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

University  of  California  at  Berkeley, Electrical  Engineering  and 

Computer  Sciences, Berkeley, CA, 94720 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

18.  NUMBER  19a.  NAME  OF 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE 

unclassified  unclassified  unclassified 

86 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Copyright  ©  2006,  by  the  author(s). 

All  rights  reserved. 

Permission  to  make  digital  or  hard  copies  of  all  or  part  of  this  work  for 
personal  or  classroom  use  is  granted  without  fee  provided  that  copies  are 
not  made  or  distributed  for  profit  or  commercial  advantage  and  that  copies 
bear  this  notice  and  the  full  citation  on  the  first  page.  To  copy  otherwise,  to 
republish,  to  post  on  servers  or  to  redistribute  to  lists,  requires  prior  specific 
permission. 


Acknowledgement 

I'd  like  to  thank  my  advisor,  Professor  Michael  Gastpar,  for  guiding  this 
work,  as  well  as  Professor  Anant  Sahai  for  useful  feedback  on  the  draft.  I 
had  helpful  discussions  about  this  work  with  Bobak  Nazer,  Dan  Hazen,  and 
Krishnan  Eswaran. 

This  work  was  supported  by  an  NDSEG  Fellowship  from  the  United  States 
Department  of  Defense,  and  the  National  Science  Foundation  under  award 
CCF-0347298. 


Finally,  thanks  to  Elizabeth  Foster-Shaner  for  her  infinite  patience. 


Observation  Uncertainty  in  Gaussian  Sensor  Networks 

by 

Anand  Dilip  Sarwate 

S.B.  Electrical  Engineering  (Massachusetts  Institute  of  Technology),  2002 
S.B.  Mathematics  (Massachusetts  Institute  of  Technology),  2002 

A  thesis  submitted  in  partial  satisfaction 
of  the  requirements  for  the  degree  of 

Master  of  Science 
in 

Engineering  -  Electrical  Engineering  and  Computer  Sciences 

in  the 

GRADUATE  DIVISION 
of  the 

UNIVERSITY  OF  CALIFORNIA,  BERKELEY 

Committee  in  charge: 

Professor  Michael  Gastpar,  Chair 
Professor  Anant  Sahai 


Fall  2005 


The  thesis  of  Anand  Dilip  Sarwate  is  approved. 


Chair 


Date 


Date 


University  of  California,  Berkeley 
Fall  2005 


Observation  Uncertainty  in  Gaussian  Sensor  Networks 


Copyright  ©  2005 


by 


Anand  Dilip  Sarwate 


Abstract 


Observation  Uncertainty  in  Gaussian  Sensor  Networks 

by 


Anand  Dilip  Sarwate 

Master  of  Science  in  Engineering  -  Electrical  Engineering  and  Computer  Sciences 

University  of  California,  Berkeley 
Professor  Michael  Gastpar,  Chair 


The  term  “sensor  network”  encompasses  a  wide  range  of  engineering  systems  with  dramatically 
different  characteristics.  We  consider  a  specific  class  of  sensor  networks  whose  objective  is  to 
reconstruct  a  source  at  a  central  terminal.  Our  objective  in  this  thesis  is  to  quantify  the  asymptotic 
error  in  reconstructing  the  source  as  the  number  of  data  sources,  sensors,  and  model  complexity 
increases.  We  consider  three  types  of  estimation  systems  -  unconstrained  estimators  for  vector 
Gaussian  sources  that  are  allowed  direct  access  to  the  sensor  observations,  estimators  for  discrete 
sources  that  receive  information  via  rate  constrained  links  from  the  sensors,  and  estimators  for 
scalar  Gaussians  whose  input  is  the  output  of  a  multiple-access  channel. 

We  first  establish  bounds  on  the  optimal  estimator  performance  of  these  networks  using  a 
centralized  estimator  with  access  to  all  of  the  sensor  observations.  We  assume  the  observations  are 
noisy  linear  functions  of  the  source  and  are  thus  specified  by  a  matrix.  Because  the  asymptotic 
error  depends  only  on  the  spectral  properties  of  this  matrix,  we  can  use  tools  from  matrix  analysis 
to  give  bounds  on  the  spectrum  and  error  in  terms  of  the  entries  of  the  matrix  for  a  number  of 
different  scenarios.  Finally,  we  look  at  the  case  where  the  matrix  is  partially  unknown.  In  some 
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cases  we  can  estimate  the  matrix  directly  from  the  data  and  in  others  we  must  minimize  the  worst 
mismatch  distortion. 


These  problems  can  also  be  looked  at  in  a  more  information-theoretic  framework.  We  look  at  a 
lossless  distributed  source  coding  problem  in  which  the  joint  distribution  of  the  sources  is  partially 
unknown.  Although  for  any  finite  number  of  sensors  standard  multi-terminal  source  codes  can 
easily  be  adapted  to  handle  the  model  uncertainty  across  time,  we  show  a  rate  penalty  is  incurred 
if  the  number  of  sensors  and  blocklength  go  to  oo  simultaneously.  This  represents  one  kind  of 
tradeoff  between  delay  and  complexity  for  the  scaling  behavior  of  these  systems. 

Finally,  we  look  at  the  case  where  the  sensors  must  communicate  their  observations  across  an 
additive  white  Gaussian  noise  multiple-access  channel.  With  a  known  correlation  structure,  the 
optimal  error  converges  to  0  as  1/M,  where  M  is  the  number  of  sensors.  However,  a  simple  feedback 
scheme  using  K  bits  broadcast  to  all  sensors  can  provide  a  distortion  that  scales  to  0  as  M~K^K+2\ 
We  conjecture  that  providing  similar  feedback  to  an  optimal  source  code  will  not  improve  the 
performance  beyond  that  of  our  protocol. 


Professor  Michael  Gastpar 
Thesis  Committee  Chair 
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Chapter  1 


Prelude:  a  model  for  sensor  networks 


Consider  the  following  hypothetical  scenario:  many  sensors  are  placed  around  the  watershed 
of  a  city  in  order  to  monitor  contaminant  levels  in  the  ground  water.  These  contaminants  may 
have  been  introduced,  for  example,  by  illegal  dumping  of  waste.  The  goal  of  the  network  is  to 
measure  the  concentration  levels  and  report  this  information  back  to  a  monitoring  station.  Every 
sensor  can  only  measure  the  concentration  of  a  single  chemical  that  is  the  by-product  of  several 
types  of  contaminants,  so  the  observation  of  an  individual  sensor  may  not  be  very  informative.  The 
sensors  have  a  small  processor,  a  wireless  radio  to  communicate  with  the  monitoring  station,  and 
limited  battery  power.  The  engineering  problem  is  to  design  an  efficient  system  for  tracking  the 
contaminant  levels  over  time. 

This  is  a  problem  of  data-gathering  and  estimation  using  a  wireless  sensor  network.  We  are 
interested  in  the  theoretical  bounds  on  the  estimation  error  at  the  central  observer  and  what 
happens  to  these  bounds  as  the  number  of  sources  and  sensors  increase.  In  order  to  accurately 
address  these  questions  we  must  have  a  model  of  remote  sensing  that  is  both  rich  enough  to 
capture  the  problems  specific  to  this  application  and  simple  enough  to  be  amenable  to  theoretical 
analysis.  In  general,  the  complexity  of  real-world  sensing  scenarios  is  not  accurately  reflected  in 
the  models  studied  by  theoreticians.  Even  though  the  physics  of  the  observation  mechanism  may 
be  well-understood,  the  resulting  model  may  be  intractable.  Different  modeling  techniques  used 
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on  the  observation  and  communication  halves  of  the  problem  may  cause  difficulties  in  merging  the 
two.  Finally,  those  results  which  can  be  proved  may  yield  little  insight  into  engineering  tradeoffs 
or  may  be  so  tailored  to  a  specific  situation  as  to  be  ungeneralizable. 

In  this  thesis  we  will  investigate  a  very  specific  class  of  data- gathering  sensor  networks.  We  will 
introduce  structured  uncertainty  into  the  mapping  between  the  observed  variables  at  the  sensors 
(e.g.  concentration  levels  of  a  chemical  by-product)  and  an  underlying  data  source  of  interest  (e.g. 
concentration  levels  of  contaminants).  This  modeling  uncertainty  is  different  from  the  uncertainty 
caused  by  noise  in  the  observations;  it  is  uncertainty  about  how  the  observations  are  related  to  each 
other  rather  than  their  reliability. 

A  sensor  network  designed  for  estimation  will  have  different  performance  limits  depending  on 
the  constraints  on  communication  among  the  sensors  and  between  the  sensors  and  the  base  station. 
Correspondingly,  we  look  at  three  different  scenarios:  centralized  estimation,  lossless  multi-terminal 
source  coding,  and  estimation  over  a  shared  additive  multiple-access  channel.  We  will  will  first 
discuss  a  toy  example  to  show  what  we  mean  by  observation  uncertainty  and  then  describe  our 
three  problems  and  main  results. 


1.1  A  toy  example 


Suppose  {,S'i  [n]}  and  {S'2  [n] }  are  a  pair  of  iid  discrete-time  Gaussian  random  processes  with 
mean  0  and  variance  o\  and  respectively.  These  two  processes  represent  different  sources  that 
we  would  like  to  estimate  using  a  sensor  network.  The  network  consists  of  M  sensors,  each  of 
which  observes  a  discrete-time  process  {Uj[n\ }  for  j  =  1,  2, . . .  M.  These  processes  are  given  by  the 
equation 


Uj[n)  = 


r 

Si  [n] 

Aij  A2j 

S-2  [n] 

+  Wj  [n] 


(1.1) 


where  {{Wj  [n]}  :  j  =  1,2,...  M}  is  a  collection  of  independent  iid  Gaussian  processes  with  mean 
0  and  variance  a\r-  The  pair  [A\j  A2j\  is  equal  to  [0  1]  or  [1  0]  equiprobably,  but  does  not  change 
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over  time.  Gathering  the  equations  into  a  matrix  we  have 


U  =  AS  +  W.  (1.2) 

This  models  the  case  where  each  sensor  observes  exactly  one  of  the  two  sources  through  noise. 

This  example  gives  an  idea  for  what  we  mean  by  observation  uncertainty.  Each  sensor  does  not 
know  a  priori  which  source  it  is  observing.  Another  view  is  that  the  covariance  of  the  matrix  A  is 
uncertain  or  that  the  joint  distribution  of  U  is  uncertain.  If  <j\  ^  <72  then  each  sensor  can  compute 
its  own  empirical  variance  and  make  an  estimate  of  which  source  it  is  observing  with  exponentially 
small  probability  of  error.  In  this  scenario  the  modeling  uncertainty  is  resolvable  at  the  sensors 
given  a  sufficient  amount  of  observed  data. 

However,  if  o\  =  a-2  then  we  must  come  up  with  something  more  clever.  Depending  on  the 
access  an  the  estimator  has  to  the  sensor’s  information,  information  about  the  correlation  between 
the  sensors  may  become  costly  as  M  increases.  If  the  estimator  has  direct  access  to  the  sensor 
observations,  it  can  try  to  sort  the  sensors  by  measuring  their  correlation  and  then  estimate  the 
sources  separately.  If  the  sensors  must  compress  their  observations  before  transmitting  them,  they 
may  include  some  overhead  to  allow  the  estimator  to  do  this  sorting.  This  overhead  could  be 
avoided  if  the  sensors  can  communicate  between  themselves,  as  we  shall  see. 

1.2  Problem  descriptions  and  main  results 

In  this  section  we  will  describe  two  different  frameworks  for  the  sensor  network  problem  as  well 
as  what  we  mean  by  uncertainty  in  observations.  For  sources  taking  values  in  a  discrete  set,  we 
assume  that  each  sensor  observes  a  different  source  directly  but  that  the  joint  distribution  of  all 
the  sources  is  unknown.  For  Gaussian  sources,  we  assume  that  the  sensor  observations  are  a  noisy 
linear  transformation  of  the  sources.  In  all  cases  we  assume  a  discrete-time  model  for  the  source 
process  as  well  as  any  communication  channels. 
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Figure  1.1.  The  distributed  source  coding  problem.  Each  terminal  views  one  component  of  a 
correlated  source  and  encodes  it  into  a  rate-limited  message.  The  decoder  uses  all  the  messages  to 
reconstruct  the  sources. 

1.2.1  Discrete  sources  and  distribution  uncertainty 

In  Chapter  3  we  address  the  problem  of  multi-terminal  source  coding  with  distribution  uncer¬ 
tainty.  The  picture  is  shown  in  Figure  1.1.  We  assume  that  the  source  to  be  estimated  is  a  tuple 
S  =  (Si,  S2,  . . .  ,Sm),  where  each  component  Sm  €  Sm  and  Sm  are  finite  sets.  These  sources  have 
some  joint  distribution  P( S)  that  is  known  to  he  in  a  set  of  distributions  A.  Sensor  m  observes 
the  sequence  S”  =  (Sm[l],  Sm[2]  . . .  Sm[n ])  of  source  samples  and  maps  it  into  one  of  2nRm  possible 
messages.  The  goal  is  to  find  the  set  of  rate  tuples  (Ri,  . . . ,  Rm )  such  that  the  decoder  can  recover 
the  original  source  sequences  with  a  probability  of  error  that  goes  to  0  as  n  goes  to  00. 

We  will  assume  that  the  true  distribution  P(  S)  cannot  be  estimated  from  the  marginal  distri¬ 
butions  of  the  sensors,  and  that  A  consists  of  these  “indistinguishable”  distributions.  In  this  case, 
the  set  of  rates  is  limited  by  the  worst-case  distributions  in  the  class  A  [7].  We  give  a  construction 
via  binning  in  the  style  of  Cover  and  Thomas  [6]  for  this  result,  which  gives  an  explicit  characteri¬ 
zation  of  the  (negligible)  overhead  needed  to  compensate  for  the  class  A.  The  overhead  is  the  form 
log  |A|n-1  logn,  which  corresponds  to  rate  needed  to  communicate  the  joint  type  of  a  distribution 
in  A. 

Since  the  focus  of  this  thesis  is  on  scaling  behaviors,  we  are  interested  in  the  case  where  M  — >  00 
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as  well  as  n  — ►  oo.  Because  A  is  a  function  of  M,  the  complexity  of  our  model  may  increase 
exponentially  in  the  number  of  sensors.  For  a  fixed  M,  we  can  write  the  asymptotic  excess  rate 
as  n  — >  oo  as  log  |Am|^_1  logn.  However,  if  M  increases  simultaneously  with  the  blocklength,  and 
A m  grows  exponentially  in  M,  for  a  fixed  n  this  excess  rate  may  not  converge  to  0.  Taking  the 
blocklength  as  a  proxy  for  the  processing  delay  and  |Am|  as  a  proxy  for  the  model  complexity,  we  can 
show  that  if  M  grows  faster  than  the  region  of  achievable  rates  must  shrink  to  accommodate 
the  information  communicated  to  the  decoder  about  the  joint  distribution. 

1.2.2  Gaussian  sources  and  fading  observations 

In  contrast  to  the  discrete  case,  for  continuous  sources  we  will  assume  that  the  number  of 
sources  L  is  smaller  than  the  number  of  sensors  M.  We  model  the  underlying  source  as  an  iid 
Gaussian  process  {S[n]}  taking  values  in  ML.  At  each  time  S [n]  is  jointly  Gaussian  with  mean 
0  and  variance  o\l-  We  view  this  as  L  independent  sources  which  we  would  like  to  estimate  to 
minimize  an  expected  squared-error  criterion. 

The  observed  signal  Um[n]  at  sensor  m  is  given  by  the  equation 

Um[n]  =  Am({S[fc]  :  k  <  n})  +  Wm\n ]  .  (1.3) 

This  follows  from  the  Wold  decomposition,  which  says  that  we  can  decompose  the  process  Um 
conditioned  on  S  into  a  deterministic  part  (a  function  of  S)  and  an  additive  stochastic  process  Wrn , 
which  we  view  as  noise.  To  make  things  even  more  simple,  we  will  assume  that  Wm  is  iid  across 
time  and  space  according  to  some  probability  measure  nw- 

By  observation  uncertainty,  we  mean  the  {Am}  are  themselves  random  variables  that  take 

values  in  a  set  of  functions.  For  example,  suppose  that  there  is  only  one  source  and  that  the  sensor 

observation  Um  is  a  noisy  weighted  average  of  the  previous  N  +  1  time  samples  of  the  source: 

N 

Um[n\  =  ^2  Am[k\S[n  -  k\  +  Wm[n]  =  (Am  *  S)[n]  +  Wm[n }  .  (1.4) 

k= 0 

Although  this  is  just  a  linear  filter,  we  may  not  know  the  filter  coefficients  exactly;  they  may  depend 
on  the  sensor’s  physical  location  with  respect  to  the  source.  We  can  either  treat  the  filter  A m  [n] 
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as  a  unknown  parameter  or  (in  Bayesian  style)  as  a  random  variable  with  some  prior  distribution 
Pa(-)- 

Let  us  collect  the  functions  Am  into  a  vector  A  that  takes  values  in  a  set  of  functions  A.  The 
choice  of  A  will  reflect  the  degree  of  uncertainty  in  the  observation  function.  If  the  realization  of  A 
is  known  to  the  sensors,  then  we  can  condition  on  A  and  find  the  average  behavior  by  taking  the 
expectation  with  respect  to  pa{ •)•  A  more  interesting  case  is  when  A  is  unknown  to  the  sensors,  so 
that  we  must  design  strategies  robust  to  the  choice  of  A. 

We  call  this  model  fading  observations  in  analogy  to  fading  communication  channels,  which  are 
used  to  model  wireless  links  with  multipath  interference.  We  identify  two  major  distinctions  -  fast 
and  slow  fading.  In  fast  fading,  the  realization  of  A  changes  at  every  time  step.  In  our  filtering 
example  above,  this  would  mean  that  the  filter  used  to  compute  Um[n\  is  different  from  the  filter 
used  to  compute  Um  [n  +  1] .  Fast  fading  may  occur  as  a  result  of  source  or  sensor  mobility  or  may 
have  to  do  with  the  physics  of  the  quantity  measured  by  the  sensors.  In  slow  fading  A  is  chosen 
once  and  fixed  for  all  time,  albeit  unknown  to  the  sensors.  Again,  the  choice  of  slow  versus  fast 
fading  models  is  application  dependent. 

In  both  Chapter  2  and  4  we  will  assume  that  A  is  a  matrix  in  MMxL.  Conditioned  on  knowing 
A.  the  sensor  observations  are  jointly  Gaussian,  so  the  optimal  centralized  estimator  is  linear.  In 
Chapter  2  we  review  MMSE  estimation  for  Gaussian  random  variables  and  express  the  error  in 
terms  of  the  singular  values  of  A.  We  can  then  use  results  from  matrix  analysis  to  analyze  the 
distortion  for  different  constraints  on  the  entries  of  A  and  the  estimator.  In  the  case  where  A  is 
unknown  but  slow-fading,  we  show  that  a  centralized  estimator  can  in  some  cases  estimate  A  and 
then  do  the  same  MMSE  estimation  as  before.  However,  for  fast-fading  A  the  optimal  strategy 
is  less  clear.  We  examine  the  case  when  the  estimator  must  be  linear.  In  the  case  where  the 
fading  distribution  pa(')  is  known,  we  can  find  the  best  linear  estimator.  In  the  case  where  pa(')  is 
unknown,  we  formulate  the  problem  as  finding  a  linear  estimator  to  minimize  the  worst  mismatch. 

In  Chapter  4  we  assume  the  sensors  must  communicate  their  observations  over  the  additive 
white  Gaussian  noise  channel  shown  in  Figure  1.2.  Rather  than  the  rate  constraints  of  Chapter  3, 
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W\ 


Figure  1.2.  The  network  considered  in  Chapter  4. 

we  assume  the  sensors’  communication  is  power-limited: 

E  [\Xm[n}\2}  <  P  .  (1.5) 

We  assume  there  is  a  single  source  (L  =  1)  and  a  slow-fading  matrix  A  that  is  a  vector  with  iid 
entries  according  to  some  bounded  zero-mean  distribution. 

In  the  absence  of  fading,  two  strategies  have  been  proposed  in  the  literature.  The  off-the- 
shelf  solution  is  to  have  the  sensors  use  the  optimal  distributed  lossy  compression  scheme  and 
then  transmit  their  compressed  messages  losslessly  across  the  communication  channel.  The  lowest 
distortion  achievable  for  a  rate  R  is  proportional  to  1  / R  [28]  and  the  highest  rate  achievable  across 
the  channel  is  proportional  to  log M,  so  the  end-to-end  distortion  is  1/log M.  However,  if  the 
sensors  simply  transmit  their  observations  raw  [18]  but  scaled  up  to  the  power  constraint  of  the 
channel,  the  distortion  scales  like  1/M.  We  call  the  former  scheme  separation-based  transmission 
and  the  latter  uncoded  transmission. 

How  does  the  slow-fading  model  affect  the  performance  of  these  two  strategies?  The  key  problem 
is  that  the  sensors  cannot  estimate  locally  the  sign  of  their  observation.  We  show  that  this  type 
of  sign- ambiguity,  which  is  detrimental  to  centralized  encoders,  renders  the  uncoded  transmission 
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scheme  useless  -  on  average,  the  sensors  cancel  themselves  out  and  the  received  signal-to-noise  ratio 
goes  to  0  as  M  — ►  oo.  Thus  the  distortion  does  not  improve  at  all  with  more  sensors. 

Recall  that  in  the  case  of  centralized  estimation  with  slow  fading,  the  estimator  could  sometimes 
disambiguate  between  the  different  fading  possibilities.  How  much  information  do  the  sensors  need 
to  perform  a  similar  hypothesis  test  in  this  scenario?  We  propose  sharing  the  signal  of  one  sensor 
(a  “beacon”)  with  all  the  others.  An  open  conjecture  is  that  this  extra  side  information  does  not 
affect  the  scaling  rate  of  the  optimal  distributed  source  code.  We  show  that  even  if  the  beacon’s 
signal  is  quantized  to  1  bit,  K  samples  of  this  side  information  is  sufficient  to  give  the  uncoded 
transmission  protocol  an  distortion  scaling  rate  of  0(M~R^K+2')).  If  the  beacon  transmits  every 
time  slot,  the  scaling  rate  will  approach  the  optimal  0(M~1). 

1.3  Where  we  are  going 

Our  interest  in  this  thesis  is  on  the  performance  achievable  as  the  number  of  sensors  tends  to 
infinity.  The  benefits  of  looking  at  the  asymptotic  performance  are  twofold.  Firstly,  because  the 
models  we  use  are  gross  simplifications  of  real  networks,  a  tight  characterization  of  the  performance 
is  impossible,  so  scaling  behaviors  may  give  more  insight  than  small  system  designs.  A  more 
aesthetic  benefit  is  that  many  of  the  expressions  have  “nice”  limiting  behavior.  However,  there  is  a 
danger  to  considering  only  the  asymptotic  picture:  intuitions  valid  for  finite  networks  break  down 
in  the  limit.  To  see  this,  consider  the  network  density.  To  increase  the  number  of  sensors,  we  can 
let  the  area  covered  by  the  network  expand  to  keep  density  constant  or  keep  the  area  fixed  and  let 
the  density  go  to  infinity.  In  the  latter  case,  the  distance  between  sensors  will  become  very  small, 
and  the  aggregate  signal-to-noise  ratio  for  a  single  source  across  the  sensors  may  tend  to  infinity. 

For  the  sensor  networks  studied  here  we  presume  the  existence  of  a  centralized  decoder  that 
wishes  to  aggregate  the  information  sent  by  the  sensors  in  order  to  reconstruct  the  source  or 
function  of  the  source.  The  constraints  on  the  decoder’s  access  to  the  sensors’  observations  are 
the  motivation  for  the  three  problems  studied  in  the  remaining  chapters  of  this  thesis.  In  the 
next  chapter  we  will  look  at  centralized  estimation,  where  the  decoder  has  full  access  to  the  sensor 


observations.  In  Chapter  3  we  will  look  at  lossless  compression  with  rate-limited  communication  to 
the  decoder.  Finally,  in  chapter  4  we  will  look  at  a  joint  source-channel  communication  scenario. 
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Chapter  2 


Scherzo:  centralized  estimation  from 
linear  models  in  AWGN 


We  turn  first  to  estimators  that  have  direct  access  to  the  sensor  observations.  Specifically,  we 
will  review  MMSE  estimation  of  Gaussian  sources  viewed  through  linear  functions  with  additive 
white  Gaussian  noise  (AWGN).  In  the  case  where  the  source  is  a  vector  and  the  functions  are 
memoryless,  the  observation  process  is  simply  multiplication  by  a  matrix.  The  bulk  of  this  chapter 
is  a  review  on  using  matrix  spectra  to  express  the  estimation  error  and  using  bounds  on  eigenvalues 
to  characterize  observation  matrices.  In  particular,  we  can  quantify  the  effects  of  sensor  density, 
dynamic  range,  and  other  engineering  parameters  on  the  limiting  behavior  of  these  systems.  We 
will  then  look  at  cases  when  the  matrix  characterizing  the  observations  is  not  known  a  priori.  In  the 
case  where  the  matrix  is  unknown  but  not  time-varying,  we  can  first  estimate  the  matrix  and  then 
build  an  optimal  linear  estimator.  If  the  matrix  is  time-varying,  we  give  a  two  characterizations 
of  the  performance  of  linear  estimators  depending  on  if  the  distribution  of  the  time  variation  is 
known. 
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2.1  Linear  estimation  error  via  matrix  spectra  :  a  review 


As  a  review,  we  will  first  describe  a  generic  matrix  model  for  the  observations  and  derive 
conditions  on  the  entries  of  the  matrix  for  the  error  to  converge  to  a  constant.  We  then  examine  a 
specific  example  of  this  kind  of  matrix  arising  from  upsampling  and  spatially  filtering  an  underlying 
source.  Finally,  we  describe  the  performance  of  central  estimators  under  multiplication  by  a  random 
matrix  and  leverage  results  on  the  spectral  convergence  to  calculate  the  asymptotic  mean-squared 
error.  Our  objective  is  to  unify  several  different  ways  of  generating  linear  models  and  analyze 
them  all  via  the  asymptotic  spectra  of  the  associated  matrices,  using  well-known  tools  from  matrix 
analysis. 

2.1.1  Problem  statement 

The  structured  observation  models  we  would  like  to  consider  are  motivated  by  different  geo¬ 
metric  assumptions  on  the  sources  and  sensors.  However,  the  induced  mathematical  model  is  the 
same  in  all  cases,  and  is  illustrated  by  the  block  diagram  in  Figure  2.1.  For  a  problem  with  M 
sensors  and  L  sources,  we  are  interested  in  the  estimation  error  as  M  — >  oo.  There  are  two  cases  of 
interest:  M/L  constant  and  M/L  — >  oo.  In  the  former  we  will  see  that  the  sampling  density  M/L 
will  appear  in  the  asymptotic  error  expressions.  In  the  latter,  we  are  mostly  interested  in  whether 
the  error  converges  to  0  and  if  so,  how  fast. 

The  source  generates  an  independent,  identically  distributed  (iid)  sequence  of  vectors  S[k\  £  ML: 

S  =  {S[k]  :k>0}  (2.1) 

where  at  each  time  k  the  vector  S[k]  is  a  jointly  Gaussian  vector  with  mean  0  and  covariance  cy| I. 
These  source  vectors  are  multiplied  by  a  matrix  A  £  MMxL,  called  the  observation  matrix ,  and 
noise  is  added  to  form  the  sensor  observation  vector 

U[k\  =  A  •  S[fc]  +  W[k]  (2.2) 

where  {VF[fc]  :  k  >  0}  are  iid  Gaussian  random  vectors  with  mean  0  and  covariance  er/vI.  We  call 
equation  (2.2)  the  observation  process. 
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Figure  2.1.  General  diagram  for  remote  observation  via  a  known  matrix  A. 

A  memoryless  centralized  estimator  for  this  problem  is  a  function 

/  :  Mm  ->  Rl  (2.3) 

that  takes  an  observation  vector  U  and  creates  a  source  estimate  S  =  f(U).  The  estimation  error 
is  measured  by  taking  the  expectation  of  a  distortion  function  d(S,  S ): 

D  =  E[d(S,  5)]  (2.4) 

In  our  case  we  measure  distortion  by  mean-squared  error: 

d(S,S)  =  ±-\\S-S\\2  .  (2.5) 

We  consider  memoryless  estimators  because  they  are  optimal  when  the  source  is  memoryless  as 
well.  However,  as  we  will  see  later,  this  may  not  be  the  case  when  the  matrix  A  changes  over  time. 

In  our  simple  case  it  is  well  known  that  the  optimal  estimator  is  linear,  so  that 

S  =  f(U)  =  F  ■  U  =  F(AS  +  W)  .  (2.6) 

The  following  well-known  proposition  gives  the  error  for  the  optimal  estimator  in  terms  of  the 
singular  values  of  A. 
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Proposition  1.  Let  {ctj  :  j  =  1,  2, . . .  L}  be  the  singular  values  of  A.  Then  the  MMSE  is  given  by 


1  L 

d4e 


2  '2 


aQa 


SUW 


—  offal  +  aw 


(2.7) 


Proof.  The  estimator  can  be  written  in  terms  of  the  covariance  and  cross  correlation  matrices  of 
the  various  vectors.  Let 


=  a2sI  (2.8) 

£w  =  E[WWt]  =  a2sI  (2.9) 

Hus  =  E[UST ]  =  A£s  (2.10) 

^ su  =  S Jjs  =  (2-11) 

£[/  =  E[UUt]  =  AY,sAt  +  (2.12) 

Then 

F  =  Zsu'Ffj1  .  (2-13) 

We  can  calculate  the  expected  squared  error: 

E[\\S-S\\2]  =  E[\\S-^su^1U\\2] 


=  tr  [E[(S  -  Esu^U^S  -  Zsu^Uf}] 

=  tr  [S5  -  E‘SuF‘fJ1T,Us] 

=  a2sL  -  tr  [aAsAT(AATal  +  afy^A] 

=  a2s  (L  -  a2s  tr  [AAT (A ATal  +  a2wrA})  . 


Let  A  =  UAaVt  be  the  singular  value  decomposition  of  A ,  where  U  and  V  are  orthogonal  and 
A  a  is  the  matrix  of  singular  values  of  A  with  (Aa)u  =  on.  Now: 


E[\\S  -  5||2]  =a2s{L-  a2s  tr  [U A\UT (U A\Ut a2s  +  u^)"1]) 


=  a 


ai°S 


-  afal  +  afy 


E 


°s°w 


-  a2a2s  +  afy 
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As  we  can  see  from  this  equation,  the  optimal  centralized  estimation  error  is  only  a  function  of 
the  singular  values  of  the  observation  matrix  A.  Finding  the  asymptotic  behavior  of  these  singular 
values  will  yield  the  corresponding  error  bounds  in  this  chapter. 


Suppose  that  that  we  are  interested  in  estimating  a  scalar  source  so  that  L  =  1  and  M  is 
allowed  to  grow  to  oo.  In  this  case  A  is  a  vector  and  a\  =  ||A||2,  so 

D  =  „  •  (2.14) 


+  ( T 


w 


If  the  entries  of  A  are  all  bounded  away  from  0  then  as  M  increases  ||A||2  — ►  oo  and  the  distortion 
scales  to  0  as  M  —>  oo.  We  will  return  to  this  estimation  problem  in  Chapter  4.  We  now  close  with 
some  specialized  examples. 


Example  :  Circulant  network 


Suppose  that  A  has  the  following  structure: 


Oo 

«M- 1  '  ' 

•  OM-L+ 1 

ai 

ao 

•  OM-L+2 

a2 

a\ 

•  OM-L+ 3 

«M- 1 

O'M- 2  •  • 

OM-L 

(2.15) 


That  is,  each  column  of  the  observation  matrix  is  a  cyclic  shift  of  the  previous  column.  This  may 
happen  if  the  sensors  are  placed  in  a  circle  around  a  second  circle  of  sources  [18].  In  a  far-held 
approximation  this  may  be  a  reasonable  model.  The  singular  values  of  A  are  the  eigenvalues  of 
B  =  AtA,  which  has  a  special  structure: 


bo  b\  ■  ■  ■  bL-i 


B  = 


bL-i  b0 


bi-2 


(2.16) 


bi  b2  ■■■  b0 
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Figure  2.2.  A  (contrived)  example  of  a  circulant  network  with  M  =  L.  The  black  circles  on  the 
inner  ring  are  sources  observed  by  an  outer  ring  of  sensors  represented  by  the  gray  circles. 


This  is  a  circulant  matrix.  A  useful  property  of  circulant  matrices  is  that  they  are  diagonalized  by 
the  discrete  Fourier  Transform  (DFT)  matrix  [20].  The  singular  values  are  the  DFT  coefficients  of 
the  sequence  {am}.  Thus  we  can  write  the  distortion  as: 


where 


the  DFT  of  the  first  row  of  B. 


1  L 

d=t£ 


asaw 


L  Pias  + 


(2.17) 

(2.18) 


For  a  physical  example  of  a  sensor  network,  consider  the  diagram  shown  in  Figure  2.1.1.  In 

this  example  M  =  L ,  with  a  circle  of  sensors  surrounding  a  circle  of  sources1.  Suppose  furthermore 

that  the  sensor  observations  can  be  written  as: 

L 

Um  =  YJZ(d(m,l))Sl  ,  (2.19) 

i=i 

where  d(m,  l)  is  the  distance  from  sensor  m  to  source  l  and  £(■)  is  an  attenuation  (path-loss)  that 

is  a  function  of  the  distance.  If  7  and  [i  are  the  radii  of  the  inner  and  outer  circles  respectively, 

xOf  course,  we  could  have  the  positions  of  the  sources  and  sensors  interchanged,  so  that  the  sensors  form  a  circular 
array,  much  like  a  panopticon  [16]. 


15 


then  we  can  write  this  as 


Um  =  Y2e(^  (t2  +  fj?  -  27 A  cos  ^2tt  1  ^  ^  ^5/.  (2.20) 

This  has  the  desired  form.  By  choosing  a  model  for  £(■),  we  can  evaluate  the  spectra  and  hence  the 
asymptotic  performance  for  centralized  estimators  with  this  topology. 

Our  interest  in  circulant  matrices  is  not  because  they  provide  an  accurate  model  for  sensor 
network  observations,  but  because  they  characterize  the  limiting  behavior  of  Toeplitz  matrices, 
which  arise  in  the  LTI  filtering  framework  later  in  this  chapter. 


Example  :  block  models 


In  some  instances  we  may  be  able  to  break  our  estimation  problem  into  independent  parts  and 
write  the  overall  distortion  as  the  average  of  the  different  components.  Consider  an  observation 
matrix  A  which  can  be  written  in  a  block  form: 


A  = 


A  i  0  •  •  •  0 
0  A2  ■■■  0 


(2.21) 


0  0  •  •  •  Ax 

Let  the  block  Aj~  have  M ^  rows  and  L &  columns.  Then  the  covariance  matrix  can  also  be  written 


in  block  form 


B  =  AAA  = 


AfAi  0  •••  0 

o  aT2a2  •••  0 


(2.22) 


0  0  •  •  •  atkak 

The  eigenvalues  for  this  matrix  are  the  union  of  the  singular  values  for  each  A fc,  1  <  k  <  K ,  so 


1  K  Lk  ^-2  2 


L  ts  U  + "2  ’ 


(2.23) 


w 


where  /3{  is  the  i-th  eigenvalue  of  AT  A^. 
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2.1.2  Bounded  energy  observations 


We  now  constrain  the  matrix  A  to  model  a  scenario  in  which  the  sensors  can  receive  limited 
power  and  each  source  emits  limited  energy.  By  this  we  mean  that  the  sequence  of  coefficients 
{ami}  is  square  summable  over  l  for  each  fixed  m  and  square  summable  over  m  for  each  fixed  l. 
It  is  intuitive  that  in  this  case  the  the  sources  cannot  be  recovered  perfectly  because  the  signal  to 
noise  ratio  for  each  source  is  bounded.  As  we  noted  in  the  last  chapter,  the  assumptions  here  fit 
with  the  expanding  network  view  of  scaling. 

Let  Ln  =  Lqti  and  Mn  =  M$n  be  the  number  of  sources  and  sensors  respectively  for  a  problem 
at  scale  n.  Let  {ami  :  l  >  0,  m  >  0}  be  a  2-dimensional  array  of  real  numbers  and  define  A (n')  = 
{ami  :  1  <  l  <  Ln,  1  <  m  <  Mn}  be  the  observation  matrix  at  scale  n. 

Bounded  row  and  column  norms 

The  constraints  we  impose  are  the  following: 

OO 

||A(n)||i  =  max  y^\ami\<£c  (2.24) 

1  <l<OC  Z ' 
m=  1 

OO 

||A(n)||oo  =  max  Y  \ami\  <  eR  (2.25) 

l<m<oo  z ' 

_  _  1=1 

These  are  the  maximum  column  sum  and  maximum  row  sum  norms,  respectively.  The  first  con¬ 
straint  bounds  the  total  contribution  of  a  source  to  all  the  sensor’s  observations  while  the  second 
bounds  the  contribution  of  the  all  the  sources  to  a  sensor’s  observation.  Under  these  two  constraints, 
we  can  bound  the  asymptotic  distortion  in  terms  of  the  constants  ec  and  er. 

Proposition  2.  If  the  observation  matrix  A  satisfies  the  bounds  in  (2.24)  and  (2.25),  then  the 
asymptotic  distortion  for  a  centralized  estimator  is  bounded  away  from  0. 

Proof.  We  will  use  the  row  and  column  sum  bounds  to  prove  a  bound  on  the  maximum  singular 
value  of  the  matrix  A^  that  is  independent  of  n.  The  singular  values  of  A ^  are  the  eigenvalues 
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of  the  Ln  x  Ln  matrix  B  =  (A^)T A^n\  Consider  the  column  sum  of  B : 

Lm  Ln  Mn  Mn  Ln  Mn 

Y  =  YY  (Lki,0>kj  ^  ^  '  \o>ki\  ^  ^  ^  ^  is  ^C^R 

j= 1  j= 1  fc=l  fc=l  j=l  fc=l 

This  bound  holds  for  all  n  and  i  so  we  have  \\B\\i  <  oo.  The  largest  eigenvalue  of  B  is  upper 
bounded  by  any  matrix  norm  on  B  [23,  p.  297],  so  <  ec£r  for  all  i. 


Turning  to  the  distortion  expression  in  (2.7)  we  can  easily  bound  the  distortion  away  from  0: 


2  '2 


D  > 


(Jo<j 


suw 


2^2 


(Jo(J 


suw 


Lr 


y 

^  ecZROs  +  aw  £c£r<?s  + 


>  0  . 


Since  this  bound  is  independent  of  n,  the  asymptotic  distortion  is  also  greater  than  0. 


(2.26) 

□ 


Simply  having  unbounded  row  or  column  norms  is  insufficient  for  the  distortion  to  converge  to 
0.  For  example,  we  could  have  a  matrix  A  whose  rank  is  only  1: 

\ 


(2.27) 


1 

2-1 

2~2 

2-3 

1 

2-1 

2~2 

2-3 

A  = 

1 

2-1 

2~2 

2-3 

1 

2-1 

2~2 

2-3 

For  this  choice  of  A,  every  row  is  summable  and  every  column  is  not,  but  all  but  1  of  the  singular 
values  are  0,  so  the  distortion  does  not  converge  to  0. 


Toeplitz  matrices 

In  some  specific  cases,  we  can  obtain  a  closed  form  solution  for  the  limit  of  the  centralized 
distortion  as  the  number  of  sources  and  sensors  tend  to  infinity.  One  particular  case  is  that  of 
Toeplitz  covariance  matrices.  Let  { b ^  :  — oo  <  k  <  oo}  be  a  sequence  of  real  numbers  that  satisfies 

OO 

Y  \bk\  =  (3  <  oo  .  (2.28) 

k=— OO 

The  matrix  B  can  be  thought  of  as  the  covariance  matrix  of  a  wide-sense  stationary  random  process. 
So  if  the  sensor  observations  are  wide-sense  stationary  across  space  we  would  get  a  matrix  with  this 
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structure.  The  corresponding  observation  matrix  A  can  be  thought  of  as  a  linear  space-invariant 
filter,  as  discussed  in  the  sequel.  It  is  important  to  realize  that  the  oversampling  ratio  Mq/Lq  is 
hidden  in  the  matrix  B,  so  that  in  the  expressions  given  below  we  cannot  simply  change  the  value 
of  Mo/Lo  to  lower  the  error. 

The  Fourier  transform  of  { b &}  is 

k 

B(u>)  =  lim  V  ake~jku  .  (2.29) 

/c— XX)  ^ J 

j=-k 

Let  Kn  =  \(Mn  —  1).  Suppose  that  the  matrix  B  =  A^n\A^)T  is  a  Toeplitz  matrix  whose  first 
row  is  (bo, . . . ,  bxn)  and  whose  first  column  is  (bo,  •  •  • ,  b-Kn)-  Since  the  (i,j)~ th  entry  of  B  is  the 
correlation  between  sensors  i  and  j,  this  gives  us  the  wide-sense  stationarity  across  space. 


The  celebrated  Grenander-Szego  Theorem  [21,  pp.  64-65]  on  the  distribution  of  eigenvalues 
of  Toeplitz  forms  gives  us  a  convenient  expression  for  the  limit.  The  theorem  states  that  for  any 
function  F  continuous  on  the  support  of  the  eigenvalues,  the  average  of  the  function  converges  to 
a  limit: 

k 

\  E  F<“;>  =  h  L  -  <2-30> 


3= 1 


In  our  case  we  have 

lim 


M, 


E 


3°w_ 


(n)  -2 


n  j=l  aj  as  +  a 


2^  2t r  J_n  B(u;)a2s  +  a 


duj  . 


(2.31) 


where  a”  is  the  j’-tli  eigenvalue  of  B.  Note  that  B  is  rank-deficient  with  Mn  —  Ln  eigenvalues  equal 
to  0.  We  can  rewrite  the  left  side  of  the  equation 

Ln 


lim 

n— >oo 


Mn  ~  Ln  2  .  1  \ 

'as  +  —  2^ 


i 


_2  _2 


Mn 


Mn  ^  n-(n)rr2  -I-  n 2 
U  3= 1  aj  aS+aW  , 


2ir  J_n  B((jj)g2s  +  a 


-du> 


(2.32) 


w 


Since  Ln/Mn  =  Lq /Mo  is  a  constant,  we  have 


D  =  lim 


E 


°saw 


n— »oo  Ln  ^  j_  „2 


n  j=l  aj  aS+aW 


M0 


i 


°s°w 


2n  J — yr  B(u>)al  +  a 


-duj 


w 


(2.33) 
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This  expression  quantifies  the  benefit  of  the  sampling  density  Mq/Lq  on  the  asymptotic  distortion. 
Note,  however,  that  the  support  of  B(lo)  depends  on  Mq/Lq-,  the  more  we  oversample,  the  more 
“pinched”  the  spectrum  becomes. 


Example  :  harmonic  decay 


Suppose 


bn  — 


sm  ujcn 


irn 


(2.34) 


which  is  just  a  sine  function.  Its  Fourier  transform  is  a  box 


B{u)  =  l(|w|<a,c) 


(2.35) 


So  the  integral  breaks  into  two  terms: 


D  =  ^±(2aU*-Uc)  +  2-&«!ru,c)-(^-l)al 


Lo  2tt  V  a?  +  gw  )  \L0 


-  (i  M°  1 

—  aS  I  1 = - - - 


2,9  Uc 

L o  7 r  err,  +  a" 


^0  vr  a s  -t-  aw 

This  shows  the  effect  of  the  bandwidth  of  this  spatial  lowpass  filter  on  the  estimation  error  -  the 
higher  the  bandwidth  the  smaller  the  error. 


Example  :  exponential  decay 


Suppose  that  bn  =  f3 ^ ,  a  two-sided  exponential  decay.  Its  Fourier  transform  is 

1  1 


1  —  Pe>u>  |  1  +  (32  —  2(3  cos  oj 


The  distortion  can  be  written  as 

M0 


D  =  aUl- 


1  - 


1 


0-^(1  +  P2  -  2/3  cos  u>) 


du 


Lo  V  27T  y.jr  cr|  +  0^(1  + /32  -  2/3  cos  w) 

We  have  to  turn  to  a  table  of  integrals  to  solve  this.  Let  us  write: 

(t2v  (1  +  (32  —  2/3  cos  lo)  a2v  (1  +  /32)  —  2cr2v/3  cos  lo  C\  +  C2  cos  lo 


+  0^/(1  + /32  —  2/3  cos  cu)  cr|  +  cr^(l  +  /32)  —  2u\ 2v/3  cos  lo  C3  +  C2COSLO 


(2.36) 
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Then  [19,  TI(341)]  gives  two  possibilities  depending  on  the  value  of  C2  —  C2: 


Cf  -  C%  =  (a2w(  1  +  f32)2  +  a2  )2  -  4/32^ 

=  <4  (  (4) 2  +  2(1  +  +  (1  -  /?2)21 

V  V  aw  /  aw  ) 

>  0  . 


Thus: 


C 1  +  C'2  cos  U ) 
C3  +  C'2  cos  LO 


u  +  (Cl  -  <20 


2(Ci  -  C3) 


arctan 


^C2  -  C2  tan  f 

C3  +  C2 


27 r— ^  H -  —  lim  2  arctan 

C2  V/Cf  -  C2 


\J C2  —  C 2  tan  ■ 

C3  +  C2 


2tt  - 


27T(t| 


So  the  distortion  is 


( 

M0  cr2s 

f  1  1 

1/2  \ 

1  T  2~  ' 

To  % 

V 

\{^)‘ +  2(1 +  CP)%  +  (1- CP?) 

/ 

D  =  a2s 


This  gives  a  more  complicated  relationship  between  the  decay  factor  /3,  the  oversampling  ratio,  and 
the  signal  to  noise  ratio  u|/<7^. 

The  most  important  class  of  Toeplitz  models  comes  from  modeling  the  sensor  observations  as 
the  output  of  a  linear  time-invariant  (LTI)  system  driven  by  the  source  observations. 


2.1.3  Observation  models  from  LTI  filters 

Having  established  that  the  bounded  energy  conditions  in  the  previous  section  limit  the  perfor¬ 
mance  of  centralized  estimators,  we  now  turn  to  a  situation  in  which  the  best  centralized  estimator 
may  be  partially  decentralized.  The  model  we  choose  is  one  of  estimating  a  spatially  distributed 
source  through  an  LTI  filter.  A  simple  physical  model  for  this  can  be  made  by  assuming  the  sensors 
and  sources  to  be  located  on  a  line,  as  shown  in  Figure  2.3.  A  diagram  of  the  sampling  situation 
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Figure  2.3.  An  example  of  a  network  on  a  line.  Imagine  the  two  lines  are  actually  the  same  line. 
The  squares  mark  the  locations  of  the  sources,  and  the  circles  mark  the  locations  of  the  sensors. 
The  sensor  observations  are  the  superposition  of  the  impulse  response  of  the  filter  centered  at  the 
sensor  locations.  The  upsampling  ratio  N  is  shown  via  the  dotted  lines. 


w 


s[x} 


t  N 


t[y\ 


h[y\ 


y] 


u[y\ 


Figure  2.4.  Filtering  model  for  remote  sensing. 


is  shown  in  Figure  2.4.  The  source  is  upsampled  and  filtered  by  a  linear  space-invariant  filter  h[y\ 
and  each  sensor  observes  a  noisy  sample  of  the  upsampled  and  filtered  source. 

Let  us  write  the  filter  as 

OO 

%]  =  ^2  hid[y  -  i]  .  (2.37) 

i=— oo 

We  can  write  the  sequence  {ami}  in  terms  of  the  filter  coefficients.  We  assume  that  the  filter  is 
absolutely  summable: 

OO 

^2  l%ll  =  h<oo  ,  (2.38) 

y=- oo 

which  implies  that  it  is  stable  and  its  Fourier  transform  H(eJU>)  exists  [30]. 

The  source  sequence  s[x]  is  first  upsampled  by  a  factor  N  that  corresponds  to  the  ratio  M/L 
in  the  previous  section.  This  upsampled  signal  is  then  filtered  by  h[y\  and  noise  is  added.  We  can 
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Figure  2.5.  Wiener  filter  solution  for  spatial  filtering. 


rewrite  the  filter  as  an  infinite  matrix  A  (see  [35,  p.  72]): 

/  :  :  \ 


MO] 

h[~N } 

h[N  -  1] 

M-i] 

h\N] 

Mo] 

v  ; 


(2.39) 


The  matrix  AAT  is  the  autocorrelation  matrix  of  the  process  u  [y] .  It  is  Toeplitz  and  generated 
by  the  autocorrelation  function  Ru[y\  =  E[u[y  +  x]u[x]].  Therefore  the  asymptotic  distortion  is 
given  by  equation  (2.33)  in  the  previous  section.  The  MMSE  estimator  for  these  observations  is  a 
linear  operator  G(- )  which,  when  applied  to  u[y],  yields  an  estimate  s[x]  that  is  closest  to  u[y\  in 
the  mean-square  sense: 

G  =  arginin E  [||s[x]  —  Ghz[y]||2]  .  (2.40) 

G 

The  solution  is  given  in  the  following  proposition. 


Proposition  3.  The  MMSE  estimator  for  the  filtered  observation  model  in  Figure  2.f  is  a  cascade 
of  a  non-causal  Wiener  filter  for  t[y]  followed  by  downsampling  by  N. 


Proof.  We  will  show  that  the  proposed  system  shown  in  Figure  2.5  is  in  fact  the  MMSE  estimator. 
Let  t[y]  be  the  output  of  the  Wiener  filter  for  t[y\  given  u[y\  and  s[x]  =  t[y/N]  for  y  =  xN  and  0 
otherwise.  By  the  orthogonality  property,  the  Wiener  filter  satisfies  the  following  condition: 

EMy\  -%])%]]  =0.  (2.41) 
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Downsampling  will  not  change  this  relationship  -  we  can  substitute  y  =  xN  to  get 


£[(.s[x]  —  s[x])s[x]]  =  0  .  (2.42) 

The  system  in  Figure  2.5  is  linear  and  the  estimation  error  is  uncorrelated  with  the  original  signal. 
Since  all  of  the  signals  are  jointly  Gaussian,  the  error  is  in  fact  independent  of  the  original  signal, 
so  this  estimator  must  be  the  MMSE  estimator  for  s[x].  □ 

To  calculate  the  filter  and  corresponding  estimation  error,  define  the  following  correlation  func¬ 
tions  and  spectra: 


rtu[y] 

=  E[t[y  +  z\u[z}\ 

(2.43) 

Rtu{eju) 

=  E  rtu[y}e~i“y 

y=- oo 

(2.44) 

n[y\ 

=  E[t[y  +  z\t[z}\ 

(2.45) 

Rt(en 

=  E  rt\y]e~juy. 

(2.46) 

y=-o o 


The  power  spectrum  of  the  non-causal  Wiener  filter  is  given  by 

_  Rtujen 
Rttiei") ' 


The  infinite  matrix  corresponding  to  this  Wiener  filter  is  the  MMSE  estimation  matrix.  The 
estimation  error  is  given  by  (2.33)  with  B(to)  =  Rt(eJui). 


Why  do  we  bother  with  this  filtering  perspective?  It  provides  us  with  a  compact  description 
of  the  estimator  (in  this  case  a  linear  filter)  and  a  means  of  constructing  it  that  does  not  rely  on 
multiplying  larger  and  larger  matrices.  An  additional  benefit  is  that  Wiener  filters  can  be  designed 
with  constraints  on  the  number  of  nonzero  taps.  In  a  sensor  network  scenario,  we  interpret  this 
as  a  constraint  on  the  number  of  sensors  that  can  collaborate.  Specifically,  we  can  constrain  the 
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estimation  matrix  G  to  be  0  outside  some  set  of  diagonals: 


/ 


\ 


5i,i 

51,2 

9l,K 

0 

0 

52,1 

52,2 

9l,K 

92,K+1 

0 

G  = 


9k,  i 

9k; 2 

9k,i< 

9k, K+ 1 

9K,K+2 

0 

9K- (-1,2  •  • 

■  9k+i,i< 

9k+i,k+i 

9K+l,K+2 

0 

0 

■  9K+2,I< 

9K+2,K+l 

9K+2,K+2 

) 


Here  we  have  constrained  the  matrix  to  operate  on  at  most  K  steps  in  either  direction. 


(2.48) 


The  solution  in  this  case  is  relatively  simple  [22,  p.  102],  Define  the  vectors  and  matrices 


u y  =  (u[y  ~  K\,u[y  -  I<  +  1], . . .  ,u[y  +  K})t  (2.49) 

g y  =  ( g[y  -  K] ,g[y  -  K  +  1], . . .  ,g[y  +  K])T  (2.50) 

P  y  =  E[u  [y]t[y]\  (2.51) 

R y  =  E[UyV^]  (2.52) 

The  solution  is  then: 

g y  =  Ry’Py  •  (2-53) 


We  follow  it  by  the  downsampling  operation  as  before.  Because  all  of  the  processes  are  wide-sense 
stationary,  this  distributed  solution  can  be  computed  offline. 

The  previous  analysis  highlights  an  important  problem  in  trying  to  make  models  more  accurate. 
By  enforcing  a  realistic  collaboration  constraint,  we  appear  to  still  have  a  clean  and  easy-to-compute 
optimal  partially-decentralized  solution  for  the  estimation  problem.  Unfortunately,  real  sensors 
are  unlikely  to  be  positioned  evenly  on  a  line,  so  the  filter  will  not  be  space-invariant  and  the 
observations  will  not  be  wide-sense  stationary  by  index. 
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2.1.4  Random  matrices  with  iid  entries 


Let  us  now  suppose  that  the  entries  of  the  observation  matrix  are  independent  and  identically 
distributed  random  variables.  One  of  the  deeper  results  of  random  matrix  theory  is  that  the 
eigenvalue  distribution  of  large  random  matrices  is  the  same  regardless  of  the  distribution  chosen 
for  the  entries.  Therefore  the  results  we  can  obtain  for  random  matrix  models  of  sensor  observations 
rely  only  on  the  assumption  of  independent  entries.  The  geometric  view  of  the  network  here  is  that 
the  sources  and  sensors  are  both  located  in  the  same  area  but  a  priori  very  little  is  known  about  the 
coefficient  between  any  particular  source  and  sensor.  However,  once  the  network  is  deployed,  the 
relevant  coefficients  can  be  measured  or  approximated.  What  we  wish  to  know  is  the  probability 
of  achieving  a  certain  asymptotic  distortion. 


The  primary  tool  we  will  use  is  the  Marcenko-Pastur  law  [25]  as  it  is  cited  in  [2,  pp.  620-621], 
We  reproduce  it  here  for  completeness: 

Theorem  1.  Suppose  that  p/n  — >  y  G  (0,  oo).  If  the  entries  of  an  p  x  n  complex  matrix  A  are  iid 
with  mean  0  and  variance  a2  and  B  =  ^AAH ,  then  the  empirical  spectral  distribution  (ESD)  of  B 
tends  (as  p,  n  —>  oo)  to  a  limiting  distribution  with  density 


Py{x)  =  S 


(2.54) 


2id^V(b~x)(x~  a),  if  a  <  x  <  6 
0,  otherwise, 

and  a  point  mass  1  —  1/y  at  the  origin  if  y  >  1  where  a  =  a(y)  =  cr2(l  —  y/y)2  and  b  =  b{y )  = 

<y2(i  +  Vv)2- 


The  important  feature  of  distribution  is  that  at  almost  all  of  the  nonzero  normalized  singular  values 
of  A  will  he  in  a  region  bounded  away  from  the  origin,  so  unnormalized  singular  values  scale  to 
infinity  like  n.  This  in  turn  will  force  the  distortion  to  0  for  our  problem. 

Proposition  4.  Suppose  that  {ami}  is  an  array  of  iid  random  variables  with  mean  0  and  variance 
a2.  Then  the  asymptotic  distortion  converges  to  0  as  M  — >  oo. 


Proof.  Because  the  singular  values  of  A  and  AT  are  identical,  the  distribution  of  the  singular  values 
converges  to  the  Marcenko-Pastur  law  with  y  =  Lq/Mq  <  1.  All  of  the  eigenvalues  in  the  asymptotic 
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Indeed,  the  smallest  eigenvalue  converges  to  a(y)  almost  surely  [2,  p.  635],  so  the  convergence 
is  even  stronger.  This  result  is  not  surprising  in  view  of  the  results  of  the  first  section,  which 
stated  that  bounded  row  and  column  sum  variances  are  sufficient  to  force  the  distortion  to  a  non¬ 
zero  value.  The  previous  proposition  has  unrealistic  assumptions  for  many  practical  systems  - 
the  received  energy  is  unbounded  in  expectation  and  the  observation  gains  must  be  iid.  However, 
since  the  Marcenko-Pastur  law  is  the  limiting  distribution  for  all  distributions  with  zero  mean  and 
variance  a2,  we  can  relax  the  condition  on  identical  distribution.  The  sufficient  condition  on  the 
collection  {ami}  of  independent  random  variables  with  mean  zero  and  common  variance  a 2  is  given 
by  [2,  p.  623].  Suppose  that  for  any  8  >  0, 

2  Mn 

EE£  [l«^|2l(|ami|>^)]  0  •  (2-55) 

n  n  1=0  m= 0 

If  this  condition  is  satisfied,  the  singular  values  again  converge  to  the  Marcenko-Pastur  law  and 
the  distortion  will  again  converge  to  0.  What  these  results  suggest  is  that  if  the  observations  have 
bounded  energy,  the  distortion  is  finite  and  positive,  but  in  the  “average  case”  of  unbounded  energy, 
the  distortion  will  converge  to  0. 


2.2  Estimation  with  fading  observations 

While  estimation  from  known  matrices  is  an  interesting  topic,  a  more  accurate  model  of  the 
observations  may  come  from  the  fading  observations  framework  mentioned  in  the  first  chapter.  Here 
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the  matrix  A  takes  values  in  in  a  set  A  and  we  must  design  an  estimator  that  is  robust  across  this 
class.  In  slow  fading,  A  is  chosen  once  and  fixed  for  all  time,  and  in  fast  fading  A  is  time-varying. 
We  will  take  up  these  two  cases  in  turn.  Slow  fading  turns  out  to  be  equivalent  to  the  analysis  in 
the  previous  section  because  the  matrix  A  can  (within  reason)  be  estimated  from  the  observations 
themselves.  The  fast  fading  case  is  more  difficult  and  we  restrict  our  analysis  to  linear  estimators. 


2.2.1  Slow  fading 


Suppose  that  A  =  {Ai,  A2, . . .  Ak}  and  that  A  is  chosen  from  A  .  Let  A(Aj\Ak)  be  the 
mismatch  error  for  a  memoryless  estimator  for  Aj  when  A  =  Ak: 


A(Aj\Ak)  =  E 


as^j  (asAjAj  +  0^1)  1(AkS  +  W)  | 


(2.56) 


We  break  the  estimation  into  two  parts:  we  first  find  A  from  the  sequence  of  observations 
U[  1],  U[ 2], . . .  U[n\  and  then  use  the  MMSE  estimator  assuming  those  are  the  true  statistics.  The 
estimation  problem  can  be  seen  as  a  hypothesis  test  between  the  different  candidates  in  A.  For 
simplicity,  we  will  assume  a  uniform  prior  on  the  set  A. 


A  crucial  property  that  we  will  need  is  that  each  Aj  £  A  induces  a  different  joint  distribution  for 
the  observations  U.  As  an  example,  suppose  +A  £  A  and  —A  £  A.  Then  the  statistical  properties 
of  the  observations  are  identical  under  both  hypotheses  and  there  will  be  no  way  to  tell  them  apart. 
Consequently,  an  estimator  built  for  +A  will  make  A(— A  |  +  A)  very  large.  In  order  to  avoid  these 
complications,  we  will  always  assume  that  A  is  separable  in  the  sense  that  p{U\Aj)  A  p(U\Ak)  for 

JAk. 


Suppose  that  K  =  2  so  that  we  have  a  binary  hypothesis  test.  We  can  bound  the  probability 
of  error  in  our  hypothesis  test  following  [10,  Sec.  3.4],  Let 


f  =  I  y^i  Pu(U\j]\A  =  Ai) 

n  gPu(U\j]\A  =  A2) 


(2.57) 


be  the  normalized  observed  log  likelihood  ratio.  A  well-known  result  states  that  the  best  estimate 
A  can  be  found  by  comparing  Tn  to  a  threshold. 
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Lemma  1.  In  the  slow  fading  case  with  discrete  A,  if  A  =  A k  the  distortion  converges  to  the 
MMSE  distortion  as  the  sample  size  goes  to  oo: 

lim  A (A({U[j]  :  j  =  1,2,.  ,.n})\Ak)  ->  A(Ak\Ak )  (2.58) 

n— >  oo 

Proof.  Let  a  =  P(A  /  Ak).  Then 

A ({U\j\  :j  =  1,  2, . . .  n}\Ak)  =  (1  -  a)A{Ak\Ak)  +  aA{Aj\Ak)  (2.59) 

By  Corollary  3.4.6  in  [10],  the  probability  of  error  satisfies  a  large  deviations  bound  so  that  a  < 
exp (— w/3)  for  some  constant  /3  >  0.  Since  A(Aj\Ak)  is  bounded,  taking  the  limit  on  both  sides 
completes  the  proof.  □ 

This  clearly  extends  to  finite  K.  For  centralized  estimators,  this  type  of  slow  fading  is  uninter¬ 
esting  because  it  can  be  disambiguated  with  exponentially  small  probability  of  error.  Let  us  now 
consider  the  case  where  A  is  not  finite,  so  that  a  simple  hypothesis  test  may  no  longer  be  sufficient. 
A  simple  approach  is  to  compute  the  sample  covariance  matrix: 

±u  = ■  (2-6°) 

nj=i 

Unfortunately,  the  sample  covariance  is  very  sensitive  to  outliers  in  the  data  [1].  Several  methods 
for  robust  covariance  estimation  have  been  proposed  in  the  statistics  literature  [1],  [5],  [26],  [40], 
and  depending  on  the  nature  of  the  set  A,  some  will  be  better  than  others.  For  example,  if  A  is  very 
structured  and  constrained,  the  EM  approach  of  [5]  may  be  effective.  The  worst-case  distortion 
can  then  be  expressed  as: 

Dmax  =  bin  sup  D(A({U\j]  :j  =  1,2,...  n})\B)  .  (2.61) 

n^ooB£A 

Although  slow  fading  for  centralized  estimators  seems  straightforward,  estimating  A  from  the 
marginals  at  each  sensor  may  prove  to  be  impossible,  as  we  will  see  in  Chapter  4.  Similarly,  if 
the  sensors  are  limited  in  their  ability  to  communicate  with  each  other,  computationally  inten¬ 
sive  covariance  estimation  procedures  that  require  significant  inter-sensor  communication  may  be 
infeasible. 
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2.2.2  Fast  fading 


We  now  turn  to  fast  fading,  in  which  the  observation  matrix  A [n]  varies  over  time.  We  will 
assume  that  A[n]  is  iid  with  some  distribution  pA(-)  on  a  finite  set  A  =  {A\,A2,  .  .  . ,  Ak}-  If  A[n] 
was  known  to  the  estimator,  it  would  simply  use  the  MMSE  estimator  for  each  A k: 

K 

Dopt  —  E  pA(Ak)A{Ak\Ak)  .  (2.62) 

k= 1 

However,  our  interest  is  in  the  case  where  A[n]  is  unknown. 

Suppose  first  that  pA(- )  is  not  known  to  the  estimator.  Following  the  previous  section,  can  we 
estimate  pA{ •)?  The  answer  is  yes,  as  long  as  K  is  not  too  large.  Consider  the  average  covariance 
matrix  of  the  observations: 

K 

Ec/  =  '^^^pA(Ak)(agAkA^  +  &wl )  (2.63) 

fc=i 

The  sample  covariance  matrix  of  the  observation  vectors  will  converge  to  E u,  with  the  same  caveats 
about  outliers  as  before.  Alternatively,  we  can  use  the  robust  methods  mentioned  earlier  to  estimate 
E jj.  Since  Ej/  is  just  a  linear  matrix-valued  function  with  coefficients  {pA(Ak)},  we  can  estimate 
{pA(Ak)}  from  Ef/  as  long  as  K  <  (M2  +  Af)/2,  the  dimension  of  the  set  of  possible  covariance 
matrices.  Clearly  there  are  many  convergence  as  well  as  numerical  stability  issues  in  performing 
this  estimation. 

Let  us  instead  look  at  the  case  where  we  are  forced  to  choose  a  fixed  memoryless  estimation 
matrix  G.  For  a  particular  distribution  pA(-)  and  G  we  have  an  optimization  problem  over  the 
error  functions 

D{Pa,  G)  =  yE\{S-  S)t{S  -  5)1  =  \e  [(5  -  G(AS  +  W))T{S  -  G{AS  +  W))]  ,  (2.64) 

jLj  L  J  jL/ 

where  the  expectation  is  taken  over  A,  S ,  and  W. 

We  can  view  this  as  a  game  in  which  one  player  chooses  a  matrix  G  to  minimize  the  distortion 
and  the  other  chooses  a  distribution  pA  to  maximize  the  distortion.  Suppose  that  pA  is  known.  Then 
the  worst  case  distortion  for  this  estimator  is  given  by  the  solution  to  the  following  optimization 
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problem: 


sup  inf  D(j>a,  G)  .  (2.65) 

PA  G 

In  the  case  where  pA  is  unknown,  the  first  player  must  choose  G  and  reveal  that  choice  to  the 
second  player.  The  relevant  quantity  is 


inf  sup  D(pa,  G)  . 


G 


(2.66) 


PA 


The  interpretation  of  this  is  that  for  each  choice  of  G  there  is  a  ” worst-case”  distribution  p a,  and 
that  we  will  choose  the  G  that  induces  the  smallest  average  distortion  for  the  worst-case  p A- 

Consider  first  the  case  where  A  is  chosen  iid  across  time  according  to  a  distribution  pA  that  is 
known  to  the  estimator,  leading  to  (2.65).  Then 

1 


D(pA,  G)  —  —  tr  [<7g/  —  2a‘gGA  +  G(a‘gTiA  +  <Jw)GT~\ 

=  ~jr  (Lag  —  2a‘g  tr (GA)  +  a'g  tr (Gt GYi a)  +  <7^  tr(GTG)'j  , 


(2.67) 


where  A  =  E[A\  and  S a  =  E[AAT] .  We  need  to  minimize  this  over  G.  Let  us  hrst  consider  the 
scalar  case  where  L  =  M  =  1. 


~qqD('PAj  G )  —  —2a‘gA  +  2agT,AG  +  2<r^G 


G  = 


Aal 


(2.68) 

(2.69) 


EA<Jg  + 

Unfortunately,  this  analysis  does  not  easily  extend  to  the  matrix  case.  The  problem  is  that  AAT  A 
Tia  in  general,  so  the  trick  of  taking  the  singular  value  decomposition  used  before  is  no  longer  valid. 

Suppose  we  take  the  orthogonality  condition  for  least-squares  estimation: 


tr  (E  [(S  -GU)UtGt])  =0 


(2.70) 


This  gives: 


If  we  just  “guess” 


tr(<jgyl  G  )  —  tr(G(cj5'SJ4  +  awI)G  ) 


(2.71) 


2  n-i 


G  —  erg  A  (cgE/i  +  (T\yl) 


(2.72) 
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then  (perhaps  unsurprisingly)  we  get  equality  in  (2.71).  Thus  the  best  estimator  is  given  by  (2.72). 
Our  estimation  error  is  therefore  given  by 

inf  D(jpa,  G)  =  —  (Lag  —  2o"g  tr  (AF  (agTiA  +  &\vl)  1-4)  +  0s  tr  (AT  (ct^YIa  +  aw^)  lj4))  (2-73) 

G  1j 

=  <4  -  j^s  tr  ( AT(a2sT,A  +  cr^I)-1  A)  (2.74) 

We  can  immediately  deduce  some  interesting  consequences  of  this  result.  Firstly,  if  the  convex 
closure  of  A  contains  the  0  matrix,  the  distortion  can  be  forced  to  Lag  by  choosing  the  pa  that 
makes  A  =  0.  Secondly,  since  our  estimator  is  only  a  function  of  the  statistics  of  the  A  process,  it 
can  be  constructed  “on  the  fly”  based  on  the  sensor  observations. 

We  now  turn  to  the  reversed  scenario,  where  a  linear  estimator  must  be  chosen  offline,  and 
then  the  worst-case  pa  is  selected,  leading  to  (2.66).  Let  us  first  consider  the  scalar  version  of  the 
problem.  The  distortion  function  is  given  by  (2.67): 

D(Pa,  G)  =  E  [S2  -  2GAS2  +  G2(A2S2  +  W2)\  (2.75) 

=  a2s  —  2GpA&s  +  ^{aAas  +  &w)  i  (2.76) 

where  fiA  and  a  a  are  the  mean  and  variance  of  A.  Note  that  this  distortion  only  depends  on  these 
parameters  of  A.  For  any  choice  of  G,  we  can  evaluate  D(Ai,G)  for  every  choice  of  At.  The  worst 
case  pa  will  concentrate  all  its  mass  on  the  Tlj’s  that  maximize  the  distortion. 

The  vector  case  is  no  different  -  for  any  linear  estimator  G,  one  or  more  of  the  possible  obser¬ 
vation  matrix  values  A  will  maximize  the  distortion.  Thus  the  set  of  possible  G's  is  partitioned  by 
this  “worst-case”  function.  Let  ILy  =  {G  :  D(Aj,G)  =  min^  D(Ai,G)}.  We  can  choose  the  best  G 
via  the  following: 

inf  sup  D(pa,  G)  =  min  <  inf  D(Aj,G)  :  Aj  e  A  1  (2.77) 

G  PA  [Geiij  J 

Unfortunately,  we  cannot  at  this  time  come  up  with  a  nice  characterization  of  this  problem  under 
general  conditions,  so  we  will  close  with  a  scalar  example. 
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Example  :  spike  source 


Let  us  take  the  pathological  example  for  which  A  =  {1, 1000}  and  cr|  =  cr2v  =  1.  We  partition 
the  set  of  possible  estimators  G  into  those  for  which  A  =  1  is  worst  and  those  for  which  A  =  1000 
is  worst.  The  dividing  point  will  be  when  they  are  equal: 


cj|  —  2G<r|  +  G2(ag  +  (?\y)  =  &s  ~  2000Gcr|  +  G2(106cj|  +  Oyy) 

„  2(103  -  1)  o 

G  =  — — - -  =  1.998  x  10~3  . 

106  -  1 

So  choosing  G  to  be  at  this  dividing  point  will  lead  to  the  lowest  worst-case  distortion: 

D  =  a g  —  2 a2sG  +  (erf  +  a^)G2  «  0.996  , 


(2.78) 

(2.79) 

(2.80) 


which  is  not  much  better  than  “guessing.”  A  slightly  less  pathological  example  may  be  to  take 

-4  =  {1,2}: 

2(2  -  1)  2 


G  = 

4-1  3 

D  «  0.556 

For  A  =  1  the  optimal  distortion  is  0.5,  so  the  loss  is  a  little  over  11%. 


(2.81) 

(2.82) 


2.3  Our  toy  example 


We  now  return  to  the  canonical  example  with  which  we  ended  the  previous  chapter.  The  first 
step  in  our  analysis  will  be  to  find  the  spectrum  of  the  observation  matrix: 


AA1  = 


(2.83) 


B  i  0 
0  B2 

where  Bj  is  the  number  of  sensors  observing  source  i.  A  centralized  estimator  could  compute  these 
numbers  approximately  by  taking  a  very  large  sample  covariance  matrix.  The  best  linear  estimator 
based  on  knowledge  of  the  B{  would  be  given  by  (2.62) 


D  = 


1  asaw 


+ 


1 


2  Bi<jg  +  2  B2crs  +  a 


(2.84) 


w 
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However,  this  presupposes  that  the  estimator  can  determine  the  true  matrix  A.  To  do  this  it 
needs  significantly  more  than  just  the  numbers  B\  and  B-2 .  A  naive  and  computationally  crippling 
computation  could  allow  the  estimator  to  perform  pairwise  tests  to  divide  the  sensors  into  two 
groups.  The  error  in  these  tests  is  exponentially  small  in  the  number  of  samples,  so  the  sorting 
would  be  accurate  for  any  finite  M.  Since  the  focus  of  this  thesis  is  not  on  computational  feasibility, 
we  leave  this  as  an  open  problem  and  assume  that  the  true  matrix  A  can  indeed  be  estimated. 

The  asymptotics  of  this  problem  are  relatively  uninteresting  -  if  each  sensor  is  equally  likely  to 
observe  S\  as  S?,  then  both  B\  and  B2  will  converge  to  M/2  as  M  — ►  00.  In  scaling  law  parlance 
we  would  say  that  the  total  distortion  goes  to  0  as  1/M.  The  reason  for  this  bland  analysis  is 
that  the  estimator  is  both  data-dependent  and  centralized.  As  we  shall  see,  the  problem  becomes 
significantly  more  complicated  once  distributed  computation  and  communication  enter  into  the 
picture. 
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Chapter  3 


Rondo:  source  coding  with 
uncertainty 


In  this  chapter  we  will  look  at  lossless  distributed  source  reconstruction  using  the  tools  of 
information  theory.  Observation  uncertainty  for  this  problem  takes  the  form  of  an  unknown  joint 
distribution  for  the  sources.  In  what  follows  we  assume  the  reader  is  familiar  with  the  basics 
of  information  theory  as  described  in  [6],  for  example.  In  the  next  section  we  will  describe  the 
Slepian-Wolf  problem  and  its  extension  to  uncertain  joint  distributions.  This  does  not  provide  any 
difficulty  in  the  proof  -  any  rates  that  are  achievable  for  all  sources  in  the  class  are  achievable 
using  a  modified  Slepian-Wolf  code.  The  main  contribution  of  this  chapter  comes  in  section  5, 
where  we  look  at  a  source  coding  system  in  which  the  blocklength  and  number  of  sensors  increases 
simultaneously.  By  linking  the  two  we  can  bound  how  fast  the  blocklength  must  grow  in  order  to 
accommodate  more  sensors  for  a  fixed  error  probability. 

Information  theory  views  the  problem  as  one  of  encoding  the  sensor  observations  into  a  discrete 
set  of  indices.  Given  a  sequence  of  source  observations  Un,  we  create  an  encoding  map  Un  — > 
{1,  2, . . . ,  N}  and  a  decoding  map  {1,  2, . . .  ,  IV}  — >  Sn  in  order  to  minimize  some  some  distortion 
function  d(sn,sn )].  In  the  noiseless  case,  the  sensor  observations  Un  are  simply  the  source  values 
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Sn.  For  lossless  source  coding,  the  distortion  measure  is  error  probability,  so  that  S  =  S  and 

=  (3.1) 

n  z — J 

i=i 

The  objective  to  create  a  sequence  of  codes,  indexed  by  n,  such  that  Pe  =  E[d(sn,sn)]  — >  0  as 
n  — >  oo.  Thus  in  the  limit,  the  reconstruction  is  “perfect.”  In  order  to  do  this,  we  will  upper  bound 
the  block  error  l(sn  ^  sn ). 

3.1  Slepian-Wolf  coding  over  a  class  of  distributions 

In  this  section  we  will  discuss  observation  uncertainty  in  lossless  source  coding  problems.  The 
type  of  observation  uncertainty  that  we  will  address  is  that  of  distributional  uncertainty.  In  the 
point-to-point  case  this  leads  to  the  well-addressed  problem  of  constructing  universal  source  codes. 
In  the  multi-terminal  case  the  famous  theorem  of  Slepian  and  Wolf  provides  the  springboard  for 
coding  schemes  that  can  handle  multiple  distributions.  The  idea  is  shown  in  Figure  3.1.  We  will 
look  at  how  the  number  of  terminals  and  hence  modeling  complexity  relate  to  the  blocklength, 
which  will  motivate  our  analysis  in  the  next  section. 

The  source-coding  community  has  already  addressed  uncertainty  in  the  underlying  source  distri¬ 
bution.  For  the  point-to-point  or  centralized  problem,  many  universal  source  codes  exist  for  discrete 
S,  using  the  Lempel-Ziv  algorithm  [42],  context-tree  weighting  [41],  or  the  Burrows- Wheeler  Trans¬ 
form  [4],  [15],  for  example.  These  source  codes  will  compress  the  source  S  at  a  rate  that  converges 
to  the  true  entropy  H(S )  without  knowing  a  priori  what  that  entropy  is.  In  one  very  simple  uni¬ 
versal  source  code,  the  encoder  observes  a  block  of  n  symbols  sn  and  transmits  the  type  T(sn )  of 
the  block  along  with  a  compressed  version  using  a  compression  algorithm  that  assumes  the  block  is 
distributed  according  to  T(sn).  The  overhead  for  transmitting  the  type  is  negligible  compared  to 
the  blocklength,  and  the  rate  required  by  this  scheme  also  converges  H(S).  This  universal  scheme 
is  more  in  the  spirit  of  the  multi-terminal  problem  we  will  investigate  next. 

Unfortunately,  universal  schemes  are  difficult  to  extend  to  multiuser  settings.  Consider  the 
problem  of  two  terminals  observing  correlated  sources  S±  and  If  the  statistics  are  known  to  the 
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A  e  A 


Px(Si,S2) 


Figure  3.1.  Source  coding  for  a  class  of  sources. 

encoders  and  decoder,  then  the  set  of  rates  at  which  the  sources  can  be  communicated  losslessly  to 
a  destination  was  found  by  Slepian  and  Wolf  [34].  An  exercise  in  Csiszar  and  Korner  [8,  Exercise 
3.1.6]  (more  generally  in  [7,  Theorem  2])  gives  an  example  of  how  to  construct  a  single  code  for  all 
correlated  sources  that  achieves  a  certain  error  exponent.  To  obtain  that  exponent,  the  encoder  is 
obligated  to  raise  the  rate  at  which  it  operates  to  that  of  a  worst-case  source.  This  is  in  contrast 
to  the  point-to-point  schemes,  where  the  coding  algorithms  converge  to  the  minimum  rate  needed 
for  the  particular  source. 

Another  coding  scheme  that  can  achieve  these  worst-cast  points  is  the  sequential  binning  scheme 
proposed  by  Draper,  Chang  and  Sahai  [13].  In  their  framework,  unknown  statistics  are  not  a 
problem  as  long  as  the  target  rate  pair  is  in  the  achievable  region  for  the  source,  and  they  provide 
error  exponents  for  their  scheme  that  can  in  some  cases  are  better  than  those  for  the  corresponding 
block  codes.  Baron,  Khojastepour,  and  Baraniuk  [3]  examined  the  notion  of  redundancy  rates  for 
fixed  block-length  coding,  which  captures  the  excess  rate  needed  to  account  for  distributed  coding 
in  the  non-asymptotic  regime.  However,  in  order  to  analyze  the  universality  of  their  scheme  over 
different  distributions,  they  adopt  the  linked-encoder  framework  of  Oohama  [27].  Other  strategies 
to  gain  universality  [12],  [24]  propose  a  feedback  link  from  the  decoder. 

However,  in  some  instances  the  encoder  and  decoder  may  have  some  limited  knowledge  of  the 
joint  distribution  of  the  sources,  and  a  code  in  the  style  of  Csiszar  and  Korner  is  more  reasonable. 
We  will  show  that  it  is  possible  to  lower  the  rates  required  to  the  worst  source  in  the  class.  The  code 
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in  [7]  is  non-constructive  and  uses  a  minimum  entropy  decoder.  Here  we  give  an  explicit  binning 
construction  in  the  style  of  [6]  using  a  decoder  that  looks  for  jointly  typical  sequences.  For  the  sake 
of  completeness,  we  include  some  standard  definitions  first. 

Definition  1.  A  discrete  memoryless  correlated  source  is  a  tuple  of  random  variables  S  =  (Si,S2, 
. . . ,  Sm)  with  variable  Sj  taking  values  in  a  finite  set  Sj.  The  variables  are  jointly  distributed  with 
some  distribution  P( S)  independently  and  identically  across  time.  A  class  of  sources  is  a  collection 
of  joint  distributions  Px(  S)  indexed  by  A  £  A.  Entropies  calculated  under  Px  are  also  given  subscript 
A,  e.g.  Hx(Si,S2),  Hx(S1\S2),  etc. 

Definition  2.  A  (n,R\,R2)  distributed  source  code  for  a  class  of  sources  ( S,T )  is  a  tuple  (0i,  </>2,l/’) 
of  maps  with 


<t>l 

:  Sn  -»• 

{1,2,... 

tynRi  | 

(3.2) 

4>2 

.  rj--n  _ J 

■{1,2,... 

(3.3) 

:  {1,2, 

x  {1, 2, . . .  2ni?2}  — >  Sn  x  Tn  . 

(3.4) 

The  probability  of  error  for  this  code  under  distribution  PX(S,T)  for  A  6  A  is 

pS  =  PxMMSn),MTn))  (Sn,  Tn))  (3.5) 

Definition  3.  A  rate  pair  (Ri,R2)  is  achievable  for  a  class  of  sources  {Px  :  A  6  A }  if  there  exists 
a  sequence  of  (n,Ri,R2)  distributed  source  codes  such  that  P^  — >  0  for  all  A  £  A.  The  achievable 
rate  region  is  the  closure  of  the  set  of  achievable  rates. 


Given  any  Slepian-Wolf  rate  region  77,  we  can  show  that  all  sources  whose  rate  regions  he  within 
77  are  achievable  using  a  single  code. 

Proposition  5.  Let  ai,  a2,  and  03  be  positive  real  constants.  Consider  the  class  of  sources 
{Px  :  A  €  A}  for  the  random  variables  (Si,S2)  that  satisfy: 


ar  >  ^a(5i|52) 

(3.6) 

«2  >  HX(S2\S{) 

(3.7) 

a3  >  HX(S\ ,  5'2) 

(3.8) 
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Then  the  set  of  achievable  rates  is  given  by  {(i?i,  R2)  :  R\  >  aq,  R2  >  a 2,  i?i  +  i?2  >  as)  . 

Proof.  The  converse  is  simple.  Fix  e  >  0  suppose  rate  R\  =  a\  —  e  was  in  the  achievable  rate 
region.  Then  it  must  be  in  the  rate  region  for  all  sources  in  the  class.  But  there  exists  some  source 
in  the  class  such  that  H(S\  j-S/)  =  a  1  —  e/2,  so  R\  is  not  an  achievable  rate  for  this  source,  which 
is  a  contradiction.  Identical  arguments  can  be  made  for  the  other  inequalities. 

The  proof  of  achievability  is  nearly  identical  to  that  in  [6,  §14.4.1],  with  a  slight  modification. 
The  key  fact  is  that  for  a  fixed  block  length  n,  there  are  only  a  polynomial  number  of  types,  so 
the  number  of  typical  sequences  over  all  sources  in  the  class  is  dominated  by  the  source  with  the 
largest  entropy  in  the  class.  The  rate  penalty  for  using  polynomially  more  bins  in  the  Slepian-Wolf 
code  goes  to  zero  with  the  blocklength. 

Let  (/?,),  R/2 )  be  an  achievable  rate  pair.  We  will  show  that  the  pair  (R±  +  5,  R-2  +  5)  is  achievable 
using  binning  for  any  5  >  0.  Fix  a  block  length  n. 

1.  Assign  to  each  sequence  sf  £  5oon  an  index  in  {1,  2, ...  2  •  2ni?1},  chosen  uniformly.  Assign  to 
each  sequence  s%  £  Sen  an  index  in  {1,  2, ...  2  •  2n^2},  also  chosen  uniformly.  These  are  our 
encoders  </>i  and  f>2- 

2.  The  messages  that  the  users  send  are  the  bin  indices  of  their  respective  source  sequences.  Let 
the  messages  be  m\  and  777-2 - 

3.  Decode  (s”,  S2)  if  4>(sf)  =  m±,  (^(slf)  =  m2,  and  (s/,  slj)  €  ,  the  e-typical  sets  under  P\. 

We  must  now  bound  the  probability  of  error.  For  a  fixed  P\ ,  the  coding  scheme  above  has 
sufficient  rate  by  the  Slepian-Wolf  theorem.  The  new  error  events  center  around  what  happens 
when  a  pair  that  is  jointly- typical  with  respect  to  P\  is  instead  decoded  as  a  pair  that  is 

jointly  typical  with  respect  to  P 'fl.  In  what  follows,  we  assume  that  (Sj\  S'f )  is  chosen  according  to 
P\  and  that  A  /  /i.  The  errors  are  then: 

1.  There  is  a  s”  such  that  f>i (sf)  =  and  (sf,Sf)  £  for  some  //, 
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2.  There  is  a  sV2  such  that  $2(82)  =  fai S %)  and  (S'f,^)  G  for  some  /j, 

3.  There  is  a  pair  (I”,  s?>)  7 -  (S'”,  SJ )  such  that  0i(s”)  =  0i(S”),  ^(s^)  =  ^(S^ ),  and  (s”,  s?>)  £ 
Te"]  for  some  /j. 


To  bound  these  events  1-3  we  must  turn  to  our  type  arguments.  Let  Vn  denote  all  types 
of  denominator  n.  There  are  at  most  (n  +  l)l‘5ll+l<S2|  such  types.  The  number  of  jointly  typical 
sequences  across  all  classes  is  then 


U  42 

MeP„nA 


The  size  of  the  jointly  typical  sets  is  then: 


<  'y  ^  2 n{Hti(Si,S2)+e) 

(3.9) 

H&VnC  A 

<  (n  +  l)|5i|+|52|  sup  2n(^(S1,52)+e)^j 

(3.10) 

\^e:P„nA  y 

<  sup  2n^^Sl,52^+e+n_1^‘Sll+l,S2^log^n+1^ 

(3.11) 

M6PnnA 

<  sup  2”^  (5l  >52)+e+5«) 

(3.12) 

MSPnHA 

where  8n  — >  0  as  n  — >  00.  Similarly, 

l4n](Si|S2)|  <  sup  2n^(5llS2)+e+5^|^n2(S2|Si)|  <  sup  2n{H^Sl^+e+Sn^  .  (3.13) 

/ieP„nA  ’  neP„rA 


For  any  fixed  A,  the  probability  of  the  decoding  errors  above  are: 


Pe,  1<  E  PM’8*)  E  E  P(M*l)  =  Msi)) 

(si>s2)  //eT^nA (sj,s5)eA)”j 


(*?>«?) 

<  2~ni?1 

sup 

2n(i7M(5i|52)+e+<5„) 

^eP„nA 

2~nR2 

sup 

2n(i7M(52|5i)+e+<5„) 

neVnn A 
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2n(J?„(Si,S2)+e+5„) 


Pe  3  <  2_n^1+R2')  SUp 
H&Vn  nA 


Since  Pn  is  dense  in  A,  as  n  — ►  oo  we  have  decoding  errors  of  the  types  above  if 


R\  <  supHfl(Si\S2 ) 

/*eA 

i?2  <  supFM(52|S'i) 
MeA 


i?l  +i?2  <  sup  Hn(Si,  S2)  ■ 
ne  a 


This  establishes  the  result. 


□ 


The  theorem  says  that  there  exists  a  Slepian-Wolf  code  that  achieves  all  rates  common  to  the 
rate  regions  of  all  sources  in  the  class.  In  some  cases  the  class  may  be  small  and  all  sources  have  the 
same  rate  region,  so  the  rates  are  tight.  As  an  example,  consider  a  memoryless  correlated  source 
(Si,S2),  with  each  component  taking  values  in  the  set  {  —  1,1}  with  distribution  Bernoulli (1/2). 
However,  their  product  S\S2  may  have  distribution  Bernoulli^)  or  Bernoulli(l  —  a)  for  some  fixed 
a.  Call  these  two  joint  distributions  Pa  and  Pp  respectively.  Note  that  H\,(a)  =  Hf>(l  —  a),  where 
Hb(-)  is  the  binary  entropy  function.  The  Slepian-Wolf  region  is  the  set  of  rates  (R i,  R2)  such  that 

R\  >  Hb(oi)  (3-14) 

R2  >  Hb(a)  (3.15) 

R\  +  R2  >  1  +  Hb(cx)  ,  (3.16) 

where  Hb(-)  denotes  the  binary  entropy  function. 

For  a  correlated  source  with  many  components,  a  similar  result  can  be  shown  to  Proposition  5. 
We  omit  the  proof  as  it  is  nearly  identical  to  that  shown  above. 

Proposition  6.  Let  {aa  :  a  C  {l,2,...m}}  be  a  set  of  positive  real  numbers.  Let  S(<r)  =  {Sj  : 
j  G  <r}.  Consider  the  class  of  sources  {P\  :  A  6  A}  for  the  random  variables  (Si,  S2, ,  Sm)  that 
satisfy: 

aa  >  H{S(a)\S(ac))  (3.17) 
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Figure  3.2.  Binary  correlated  source  with  one  remote  component.  Conditioned  on  the  first  compo¬ 
nent,  the  decoder  can  figure  out  if  the  second  one  was  complemented  or  not. 

Then  the  set  of  achievable  rates  is  given  by  {(Ri,  R2,  ■  ■  ■ ,  Rk)  '■  R(&)  >  Vcr},  where 

R(a)  =  £  Rj  .  (3.18) 

j&a 

3.2  Multi-terminal  coding  with  many  terminals 

Consider  a  slight  variation  of  the  previous  example,  as  shown  in  Figure  3.2.  The  source  (Si,  S2) 
have  the  same  marginals  as  before,  but  the  product  S1S2  is  Bernoulli(a).  The  source  S±  is  viewed 
directly  by  the  first  terminal,  but  the  second  source  S 2  is  viewed  remotely  through  a  channel  may 
or  may  not  flip  all  of  the  bits,  forming  the  observed  sequence  T2.  This  induces  the  same  class  of 
sources  on  (Si ,  T2)  as  considered  above.  Using  the  same  coding  scheme,  we  can  reconstruct  Si  and 
T2  exactly.  From  the  empirical  statistics  of  a  long  block-length  sample  we  can  recover  the  single  bit 
that  determines  the  particular  distribution  in  the  class.  This  “extra  bit”  corresponds  to  the  5n  term 
in  the  proof  above.  There,  approximately  logn  bits  of  information  about  the  joint  distribution  of 
the  source  are  transmitted  in  addition  to  the  source  sequences. 

We  now  want  to  look  at  what  happens  as  the  number  of  sources  gets  larger.  If  there  are 
rn  sources,  the  previous  scheme  would  use  m  log  n  bits  about  the  joint  distribution.  One  way  of 
thinking  about  the  blocklength  n  is  as  a  processing  delay.  If  the  number  of  sources  is  allowed  to 
grow  with  the  processing  delay,  so  that  rn  and  n  are  going  to  00  together ,  the  penalty  may  become 
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non-negligible.  Suppose  that  \Sj\  =  K  for  all  j.  For  any  fixed  number  m  of  sources,  the  number 
of  types  of  denominator  n  is  upper  bounded  by  (n  +  l)mA .  As  m  gets  larger,  the  number  of  types 
grows  exponentially  with  m,  which  causes  a  larger  rate  penalty  using  the  Slepian-Wolf  system. 
Increasing  m  requires  using  a  larger  blocklength  to  achieve  the  same  error  probability.  We  would 
like  to  capture  this  intuition.  To  do  this,  we  will  fix  the  ratio  (mlogn)/n.  and  consider  then  a 
sequence  of  (n(m),  Pi, . . .  Rm)  codes. 

Definition  4.  A  rate  sequence  {Ri}{=^  is  achievable  under  scaling  f(n)  if  there  exists  a  sequence 
of  (n,  Pi, ... ,  Rf(n))  source  codes  whose  probability  of  error  en  — >  0  as  n  — ►  oo. 

This  definition  forces  a  tradeoff  between  the  number  of  sensors  and  the  blocklength  of  the 
corresponding  universal  code.  If  the  modeling  complexity  increases  exponentially  in  the  number 
of  terminals,  we  may  incur  a  rate  penalty.  However,  if  the  number  of  models  is  only  polynomial 
in  the  terminals,  the  exponential  error  probability  from  the  block  code  can  easily  encompass  the 
modeling  complexity  as  well.  Our  main  result  is  the  following. 

Proposition  7.  Let  {aa  :  a  C  {1,2  be  a  set  of  positive  real  numbers.  Let  S(cr)  =  {Sj  : 

j  G  a}.  Consider  the  class  of  sources  {Pa  :  A  £  A}  for  the  random  variables  (Si,  S2, ... ,  Sm )  that 
satisfy: 

aa  >  H(S(a)\S(ac))  (3.19) 

Then  the  sum-rate  for  achievable  tuples  under  scaling  n/logn  is  bounded: 

m 

lim  Rj  >  lim  H(Si,  S2,  ■  ■  ■ ,  Sm)  +  ci  (3.20) 

m— >00  ^ J  m— >00 

3= 1 

Proof.  We  can  simply  look  at  the  entropy  of  the  source.  For  a  fixed  m .  and  n,  consider  jointly 
encoding  the  source  (Si,  S2, . . . ,  Sm).  This  requires  nH(S  1,  S2, . . . ,  Sm)  bits  plus  H(Vn  0  A).  There 
are  C2nmC3  sources  in  Vn  0  A,  so  we  need  C4777.  log  n  additional  bits.  Dividing  by  the  block  length  n: 

m  , 

lim  yRl>  lim  >  H(Si,  S2, . . . ,  Sm)  +  C4 - — —  .  (3-21) 

m— xx)  ^ J  J  m^oo  fi 
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Note  that  if  m  and  n  grow  at  the  same  rate,  then  the  rate  penalty  increases  logarithmically  in  the 
block  length.  In  some  sense  we  can  view  this  as  a  density-delay  tradeoff  for  source  coding  systems 
of  this  type.  For  very  large  systems,  the  delay  required  to  amortize  the  modeling  uncertainty  is 
also  large. 


3.3  The  example  revisited 

Although  we  only  discussed  lossless  coding  problems  in  this  chapter,  the  ideas  can  provide  some 
intuition  for  our  example.  In  the  case  of  distributed  compression,  the  problem  does  not  naively 
decompose  into  that  of  partitioning  the  sensors  into  two  groups  and  using  a  CEO  code  [28]  for  each 
group.  However,  a  little  more  thought  shows  that  the  only  information  needed  by  the  terminals  to 
do  the  encoding  is  the  number  of  sensors  observing  the  same  source  as  them.  However,  the  decoder 
must  know  which  source  is  observed  by  each  sensor  in  order  to  do  the  joint  decoding  required  by 
the  CEO  code. 

In  the  case  where  the  matrix  A  is  known,  the  distortion  can  be  bounded  by 

(To  CTc 

D  <  — - ^ - +  — - ^ -  ,  (3.22) 

%-B^l  -  exp(-2 Ri/Bj)  +  1  -  exp(-2i?2/B2))  +  1 

"w  aw 

where  the  rates  R\  and  I?2  are  the  sum  rates  for  the  terminals  observing  5j  and  S2  respectively. 

How  much  information  is  needed  to  enable  this  performance?  In  order  to  determine  which 
source  they  are  observing,  each  sensor  can  compare  its  own  source  sequence  to  a  common  “pilot” 
sequence,  perhaps  observed  by  another  sensor.  This  pilot  sequence  could  be  the  signs  of  the  first  p 
samples  observed  at  the  sender.  By  measuring  the  empirical  correlation  between  the  pilot  and  their 
own  signal,  each  sensor  can  decide  with  high  probability  to  which  group  they  belong.  In  addition, 
they  would  need  the  numbers  B\  and  B2.  In  terms  of  bits,  we  have 

p  +  log  B\  +  log  C>2  <  p  T  2  log  M  bits  (3.23) 

Since  these  bits  must  be  shared  by  all  the  sensors,  it  would  be  sufficient  to  broadcast  all  this 
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information  to  them  before  transmission.  This  is  precisely  the  intuition  behind  the  scheme  described 
in  the  next  chapter. 
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Chapter  4 


Finale:  fading  observations  and 
alignment 


In  this  chapter  we  look  at  a  joint  source-channel  coding  problem  with  uncertainty  in  the  ob¬ 
servation  process.  We  call  these  models  fading  observation  models  in  analogy  to  fading  channels  in 
wireless  communications.  Much  of  this  work  has  appeared  in  some  form  in  [32],  [33].  In  a  sense,  this 
chapter  will  deal  with  a  part  of  the  interface  between  the  previous  two  problems.  The  communica¬ 
tion  channel  imposes  constraints  on  our  encoding  strategies.  In  the  language  of  Chapter  2,  these 
may  be  constraints  on  the  covariance  of  the  estimation  matrix.  In  terms  of  the  coding  problems  in 
Chapter  3,  there  may  be  a  rate  constraint  on  our  codes.  However  the  underlying  question  is  this: 
what  should  the  sensors  send  in  order  to  best  help  the  estimator,  given  that  they  must  share  the 
available  communications  resources? 

At  first  glance,  it  would  seem  that  minimizing  the  redundancy  in  the  messages  sent  by  the 
sensors  would  provide  the  decoder  with  the  most  information  for  its  estimate.  More  precisely,  the 
sensors  would  use  an  optimal  CEO  source  code  [28]  followed  by  a  capacity-achieving  channel  code. 
The  decoder  could  then  decode  the  compressed  source  observations  and  use  those  to  estimate  the 
source.  The  problem  with  this  approach  is  that  the  multiple-access  channel  is  a  bottleneck  for  the 
rate.  If  the  communication  power  available  grows  linearly  with  the  number  of  sensors,  the  capacity 
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grows  only  logarithmically.  The  end-to-end  distortion  for  this  strategy  scales  to  zero  like  1  /  log  M 
for  a  Gaussian  source  observed  in  Gaussian  noise  with  an  additive  white  Gaussian  noise  channel. 

The  other  approach  is  to  let  the  sensors  collaborate  in  communicating  their  observations.  The 
uncoded  transmission  strategy  adopted  by  Gastpar  and  Vetterli  [17]  is  one  such  collaboration 
method.  However,  all  of  the  sensors  must  know  the  joint  statistics  of  the  observation  process  as 
well  as  the  communication  channel.  In  the  case  where  there  is  observation  uncertainty,  the  problem 
of  uncoded  may  be  significantly  more  difficult,  as  we  have  seen  in  the  case  of  centralized  estimators. 

In  this  chapter  we  consider  the  slow- fading  models  examined  at  the  end  of  Chapter  2.  Although 
we  cannot  prove  that  there  is  a  strict  penalty  in  the  CEO  code,  the  uncoded  transmission  scheme 
fails  completely  in  the  presence  of  fading.  We  exhibit  a  simple  feedback  protocol  that  can  bootstrap 
the  uncoded  transmission  scheme  into  a  regime  with  a  distortion  that  scales  like  Af  ”1//3,  better  than 
that  of  separate  source  and  channel  coding  without  fading.  We  also  conjecture  that  the  CEO  code 
with  similar  feedback  still  exhibits  the  same  1  /  log  M  scaling. 

4.1  Uncertainty  in  observations 

We  first  describe  a  general  class  of  sensor  observation  models  and  then  a  specific  example  that 
will  dominate  the  bulk  of  the  analysis  in  this  chapter. 

4.1.1  Fading  observations  :  a  general  model 

The  model  we  propose  is  similar  to  that  studied  at  the  end  of  Chapter  2.  For  clarity,  we  review 
some  notation.  We  use  an  uppercase  letter  S  for  a  random  variable,  a  lowercase  s  for  its  realization, 
and  Ps{s)  for  its  distribution.  We  write  an  independent,  identically  distributed  (iid)  sequence  of 
random  variables  indexed  by  n  as  or  Sn.  If  S[7i]  is  vector  valued,  we  denote  the  m-th 

component  of  S[7i]  by  Sm[n]. 
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Figure  4.1.  Sensor  network  with  fading  observations.  The  function  A  can  be  arbitrary,  but  we  will 
generally  assume  that  it  is  a  linear  transformation. 

A  source  generates  a  sequence  of  iid  symbols 

5"  =  {S[n]}~  r  (4.1) 

at  each  time  n  according  to  Ps(s).  The  M  sensors  observe  the  the  source  through  the  observation 
function  A(-),  corrupted  by  noise  W  that  is  iid  across  sensors  and  time  with  distribution  P\y(w): 

Un  =  A(Sn)  +  Wn  .  (4.2) 

The  observation  function  A(-)  is  a  random  variable  taking  values  in  a  set  of  functions  A  according 
to  some  known  distribution  Pa{cl).  The  choice  of  A  depends  on  the  specifics  of  the  sensors’  design 
and  reflects  the  relationship  between  the  quantities  of  interest  and  the  actual  observed  variables. 
We  do  not  assume  the  sensors  know  the  function  A(-)  after  it  is  chosen.  The  fast-fading  situation 
with  centralized  estimation  was  dealt  with  in  Chapter  2,  and  we  do  not  address  it  in  the  context 
of  joint  source-channel  coding;  instead  we  will  focus  on  the  slow-fading  case. 

The  model  generalizes  others  in  the  literature.  We  can  view  the  source  as  being  observed 
through  an  input  channel  as  in  [11],  where  the  known  channel  is  replaced  by  a  fading  channel. 
If  A(-)  is  the  identity  map  with  probability  one,  we  reduce  to  the  the  CEO  problem  [37].  If  S 
is  a  jointly  Gaussian  vector  and  A(-)  is  a  random  matrix,  we  have  the  source  model  studied  by 
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Viswanath  [36].  If  A  is  deterministic  and  the  communication  channel  is  also  modeled  by  matrix 
multiplication,  we  have  the  network  studied  in  [18].  Typical  channel  fading  models  model  the 
fading  as  multiplicative  because  of  experimental  evidence  on  multipath  interference.  While  the 
main  example  we  study  in  this  chapter  is  multiplicative,  it  is  for  reasons  of  expediency  rather 
than  the  appropriateness  of  the  model.  As  usual,  we  will  treat  sources  and  noises  that  are  jointly 
Gaussian. 

In  this  network,  the  communication  channel  is  added  into  the  picture.  We  model  it  as  a  general 
multiple-access  channel  with  transition  probabilities  p(y \xi,X2,  ■  ■  ■  xm)-  The  sensors  encode  their 
observations  independently  into  channel  inputs  Xn.  The  specific  realization  of  A  is  not  known  to 
the  encoders,  but  the  prior  distribution  Pa  (a)  is  known.  The  communication  channel  may  have  a 
cost  function  p  which  the  inputs  must  satisfy  with  some  constraint  P  in  the  following  sense: 

E\p(X[n\)\  <  P  .  Vn  (4.3) 

The  receiver  uses  Y  to  form  an  estimate  Sn  =  {S'[n]}^T1  that  minimizes  a  distortion  measure 
d(Sn ,  Sn).  In  general,  the  decoder  also  does  not  know  the  realization  a  of  A,  so  the  distortion  will 
depend  on  this  realization.  We  write  the  end-to-end  distortion  D(A,  M,  P)  that  is  achieved  as 

D(A,M,P)  =  Es  .  (4.4) 

The  goal  of  our  coding  scheme  is  to  minimize  the  expected  value  of  D. 

In  order  to  express  out  scaling  results,  we  use  standard  asymptotic  notation.  We  say  that  a 
function  f(M )  scales  as  fast  as  g(M )  or  f(M)  =  Q(g(M))  if  there  exists  a  constant  ci  such  that 

f(M)  >  cl9(M)  .  (4.5) 

We  say  that  a  function  f(M )  scales  as  slow  as  g(M )  or  f(M )  =  0(g(M ))  if  there  exists  a  constant 

cu  such  that 

f(M)  <  cug(M )  .  (4.6) 

We  write  /  =  0(max{</i(M), ^(Af)})  if  there  is  a  constant  cu  such  that 

f(M)  <  cumax{gi(M), g2(M)}  .  (4.7) 


49 


W\ 


Figure  4.2.  Gaussian  network  with  fading  observations. 

By  limiting  ourselves  to  these  asymptotic  characterizations,  we  conveniently  ignore  many  factors 
which  affect  smaller  networks.  However,  we  feel  that  this  analysis  is  useful  in  characterizing  achiev¬ 
able  performance  limits. 

4.1.2  Scalar  multiplicative  fading 

The  first  question  to  ask  for  sensors  observing  faded  observations  is  to  what  extent  they  can 
determine  the  fading  process.  Consider  the  case  where  each  sensor  receives  Am(S[n\)  +  Wm[n], 
where  the  Am  are  drawn  iid  from  some  distribution  over  a  finite  set  A.  The  sensors  can  estimate 
their  own  marginal  distribution  locally.  If  two  fading  functions  a\,  <22  £  A  induce  different  marginal 
distributions  on  Um  at  sensor  m,  then  the  sensor  can  in  theory  discriminate  between  them  based 
on  the  empirical  statistics.  The  problem  introduced  by  fading  observations  in  this  setting  is  from 
different  fading  functions  inducing  the  same  marginal  distribution  on  Um.  For  point-to-point 
systems,  this  is  similar  to  the  rate-distortion  problem  considered  by  [9].  In  order  to  collaborate,  the 
sensors  must  disambiguate  between  the  fading  processes  which  could  induce  their  local  distribution. 
We  call  this  problem  one  of  alignment. 

Again,  we  turn  to  Gaussian  sources  and  Gaussian  noise.  The  model  we  consider  is  shown 
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in  Figure  4.2.  The  source  S  is  a  Gaussian  random  variable  with  mean  0  and  variance  cr2.  The 
observation  noises  {Wm}^=1  are  iid  zero-mean  Gaussian  random  variables  with  mean  a 2(-.  The 
multiple-access  channel  is  a  standard  memoryless  additive  white  Gaussian  noise  channel  with  a 
sum-power  constraint  ^-F[|Xm|2]  <  MP  on  the  inputs.  The  channel  noise  Z  is  Gaussian  with 
zero  mean  and  variance  <r|.  The  distortion  measure  is  mean-squared  error: 

1  f  N  |  2 

d(S",S")=  lim  - E  53  SH-SH  .  (4.8) 

°°  _n=l 

Consider  a  Gaussian  network  in  which  the  observation  fading  functions  {Am(-)}  are  multipli¬ 
cation  by  scalars  {Am}  satisfying  the  following: 


{Am}  are  iid  with  distribution  pA(a) 

(4.9) 

pA(a)  =  pa(-cl) 

(4.10) 

\Am\  <  v  a.s. 

(4.11) 

We  call  this  bounded  real  scalar  fading.  A  specific  example  that  we  will  use  is  when  the  {Am}  are 
iid  and  equiprobable  on  the  set  {  —  1,  +1}.  In  the  next  section  we  will  examine  the  performance  of 
coding  strategies  on  this  kind  of  source  fading. 

4.2  Existing  schemes 

The  two  schemes  we  examine  for  this  joint-source  channel  coding  problems  are  uncoded  trans¬ 
mission  and  separation-based  coding.  Both  of  these  schemes  have  already  been  analyzed  in  the 
absence  of  fading  [17].  In  the  separation-based  approach,  the  problem  is  decomposed  into  that 
of  distributedly  compressing  the  source  into  independent  bitstreams  and  then  encoding  those  bit- 
streams  over  the  multiple-access  channel.  On  the  other  hand,  the  sensors  could  simply  forward 
their  observations  and  use  the  fact  that  the  channel’s  summing  operation  partially  computes  the 
MMSE  estimate  -  this  is  what  we  call  uncoded  transmission. 
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4.2.1  Separate  source  and  channel  coding 


In  this  section  we  derive  the  asymptotic  performance  for  separation-based  compression  of 
sources  with  distributed  encoding.  Let  Rs(D)  be  the  rate-distortion  function  for  the  source  S 
with  distortion  limit  D.  Note  that  R{D )  is  vector-valued  and  represents  a  rate  tuple  (R i, . . .  Rm)- 
Let  C(P)  be  the  capacity  function  for  the  multiple-access  channel  p(y\xi, . .  .xm)  under  the  cost 
constraint  E[p{x i, . . .  xm)\  <  P ■  Note  that  C(P)  is  also  a  tuple  of  achievable  rates  (R^\  ■  ■  ■  Rm) 
for  reliable  communication  across  the  multiple-access  channel. 

Suppose  we  have  a  distributed  source  code  for  S  with  rates  (rq, . . .  vm)-  If  R{D)  =  r  and 
r  <  C(P)  then  we  can  compress  the  source  and  transmit  the  compressed  messages  reliably  across 
the  channel.  Thus  if  R{D)  <  C(P)  component- wise,  we  can  achieve  distortion  D  across  the 
channel.  If  r  >  C{P )  in  any  component,  then  the  rate  tuple  generated  by  the  source  code  cannot 
be  communicated  reliably  across  the  channel. 

Let  Rtot  =  J2Rm-  If  all  of  the  signs  {Am}  are  known,  the  sum-rate  for  source  coding  using  a 
CEO  source  code  in  the  limit  as  M  — >  oo  is  given  by  [28,  Equation  (6)]: 

2 

D(Rtot,M )  =  — - ^ -  .  (4.12) 

^M(l  -  exp ( —2Rtot /M))  +  1 
aw 

The  sum  capacity  of  the  Gaussian  multiple-access  channel  is  upper-bounded  by  the  case  when  the 
messages  may  be  dependent.  The  total  power  is  MP,  so: 

1  (  M2P\ 

Rtot  <  2  fog  ^f  +  -^2-J  (4.13) 

Substituting,  the  achievable  distortion  is  lower  bounded  by 

DW)  *  -  T^f  ct2  ,1  /MX  •  (4-14) 

°w  +  aSM  “  (CT|+M2 p)  J 

Taking  the  limit  as  M  — >  oo  gives  the  1  /  log  M  behavior  described  at  the  beginning  of  this  chapter. 
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4.2.2  Uncoded  transmission 


In  the  uncoded  transmission  scheme,  each  sensor  scales  its  own  observation  to  meet  the  power 
constraint  of  the  channel.  Define  the  constant  r/  as 


P 


11  V  °s  +  aw 


Then 


The  received  signal  is  then 


Xm[n]  =  r]Um[n\  =  rj(AmS[n\  +  Wm[n])  . 


M 


M 


Y[n ]  =  rj  E  ^  ^  Z[n\ 


\m=l 


m=  1 


(4.15) 


(4.16) 


(4.17) 


The  MMSE  estimate  of  S  given  Y  will  depend  on  the  random  variable  B  =  Ylm=i  For  a 
fixed  B  define 


L(B,  M)  = 


2  2  2 
m+(Ps/pw)v-2(Ts  + 


(4.18) 


The  expected  distortion  is  then 


DUnc  (. MP)  =  Eb[L(B,M )] 


(4.19) 


If  all  the  gains  Am  are  equal  to  1,  then  B  =  M  and  the  function  L(B,  M)  =  0(M  x).  A  more 
interesting  case  is  when  we  have  bounded  real  scalar  fading: 

Proposition  8.  For  the  Gaussian  network  with  fading  observations  satisfying  (4-9)-(4.11),  the 
distortion  scales  like  D(l)  using  the  uncoded  transmission  scheme. 


Proof.  Note  that  the  sensors  can  each  determine  the  magnitude  of  their  fading  coefficients  {Am} 
by  computing  the  marginal  density  function  of  their  observations.  For  sensor  m ,  the  density  of  Um 
is 


PUmium)  = 


(4.20) 
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This  density  is  identical  for  Am  =  ±am.  The  sensors  apply  the  gains 


P 


Vni  — 


AL,CTc  + 


2  -a? 


xmuS  '  UW 

to  get  the  distortion  in  (4.19).  Note  that 

^  p  p 

2  (j2  J^2  +(J2  +aS>  M  2  u2  +  (T2  +aS>  M ^ 
m=1  GSAm  +  aW  aSV  +  aW 

for  some  fj,  >  0. 

Thus  we  can  lower  bound  the  distortion  by 


(4.21) 


(4.22) 


D(M)  > 


at 


Vm»+1 


Now  note  that  B  =  Ylm=i  7lmAm ,  and 

E\T]mAm\  =  E 
E[v2mA2m}  =  E 


I  pa 2 

_ 

Am°S  + 


p  A 2 

r  A±m 


■  sgn  Ar 


=  0 


=  aB  <  oo  . 


(4.23) 


(4.24) 

(4.25) 


A<4  +  <4j 

So  by  the  central  limit  theorem  [14],  converges  to  a  Gaussian  random  variable  with  mean 

0  and  variance  o^. 


We  can  now  write  the  expected  distortion  in  the  limit: 


lim  E\D(M)\  >  lim  E  ,  , 

M— >oo  1  V  n  -  M- oo  [  (5M-1/2/l)2  +  l 


=  £ 


lim 


cn 


|_M— >oo  (BM-V 2 A2  +  1 


crt 


e  +  1 


>  0, 


(4.26) 

(4.27) 

(4.28) 

(4.29) 


where  £  is  normally  distributed  with  mean  0  and  variance  cr^/iT  Thus  the  expected  distortion  does 


not  converge  to  0  as  Af  — »  oo,  so  D(M)  =  12(1). 


□ 


We  can  also  give  a  result  that  is  slightly  stronger  only  on  technical  grounds. 
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Proposition  9.  Suppose  the  fading  observation  functions  in  the  Gaussian  model  are  multiplication 
by  random  variables  {Am}  taking  values  in  {—1, 1}-  Suppose  the  {Am}  are  exchangeable  and 

Jim  M~l  ^2  =  0  a.s.  (4.30) 

»}iSoP(i®E'4’"|>e)  £  K'<1‘  <431) 

Then  uncoded  transmission  yields  an  expected  distortion  that  scales  with  M  like  0(1). 

Proof.  Let  H  =  (B  >  \[Me).  In  this  case, 

P(HC)  >1-Ke>0  (4.32) 

We  can  find  a  lower  bound  on  the  distortion  by  only  taking  the  expectation  over  Hc: 

Dunc{MP)  >  EV[L(M  -2T,M)} 

>  L(VMe2,M)  (Ke) 

As  M  — >  oo,  this  function  converges  to  a  constant,  so  Dunc(M )  =  0(1).  □ 

4.3  A  simple  feedback  framework 

The  problem  with  the  uncoded  transmission  scheme  is  that  the  sensors’  observations  are  not 
aligned,  so  blindly  forwarding  their  observations  causes  sufficient  interference  to  void  the  collaborate 
gain  from  the  coherent  addition  of  the  source  observations.  Another  way  of  viewing  this  is  that  any 
choice  of  gains  is  a  choice  of  estimator,  and  that  every  estimator  performs  poorly  in  expectation 
over  the  distribution  of  A.  This  lack  of  alignment  is  caused  by  the  zero-mean  distribution  we  chose 
for  the  observation  gains.  We  now  present  a  scheme  that  uses  a  small  amount  of  extra  information 
that  recovers  some  of  the  performance  for  uncoded  transmission  in  the  unknown-sign  model.  We 
show  that  with  a  single  bit  broadcast  to  all  the  sensors  we  can  make  the  distortion  converge  to  0 
as  0(M~ 1//3),  and  for  K  bits  we  get  0(M~KAK+2)f  We  generally  refer  to  the  extra  information 
used  as  feedback,  although  we  emphasize  that  it  need  not  come  from  the  end  receiver  in  the  sensor 
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network.  Indeed,  an  interesting  case  is  when  the  feedback  comes  from  one  of  the  other  sensors  in 
the  network. 

The  method  we  propose  to  align  the  sensors  is  to  divide  the  operation  of  the  network  into  two 
phases  -  a  discovery  phase  and  a  transmission  phase.  We  define  the  time  axis  so  that  the  discovery 
phase  is  at  times  n  <  0  and  the  transmission  phase  is  at  times  n  >  0.  In  the  discovery  phase  sensor 
m  forms  an  estimate  Am  of  its  own  observation  gain  Am  based  on  its  own  observations  and  some 
feedback.  Sensor  m  then  forwards  r]AmUm  to  compensate  for  its  observation  gain.  If  a  sufficient 
fraction  of  the  sensors  are  aligned,  the  distortion  will  converge  to  0  rapidly  as  the  number  of  sensors 
increases. 

4.3.1  A  single  bit  of  feedback  for  sign  fading 

In  this  section  we  consider  a  discovery  phase  of  only  one  time  step.  Let  So  =  S[0]  be  the 
value  of  the  source  during  the  discovery  phase  on  which  we  base  our  feedback  signal.  Let  /(So) 
be  a  feedback  signal  that  is  broadcast  to  all  of  the  sensors.  The  function  /(•)  may  be  stochastic 
-  for  example,  it  may  be  a  noisy  observation  of  the  source.  We  give  examples  of  specific  forms  of 
feedback  and  decision  rules  in  the  next  section. 

After  receiving  /(So),  each  sensor  m  forms  an  estimate  Am  =  gm(f(S o),  Um)  of  its  observation 
gain.  Conditioned  on  a  value  So  =  so,  the  sensor  observations  are  independent  and  identically 
distributed,  so  the  events  of  successful  estimation  of  the  observation  gains  are  independent  from 
sensor  to  sensor.  By  the  exchangeability  of  the  gains  Am,  each  sensor  should  attempt  to  maximize 
their  probability  of  success,  and  will  adopt  the  same  decision  rule  gm{-)  =  g(-).  Let  v(so)  denote 
the  probability  that  a  sensor  correctly  estimates  its  own  observation  gain.  Let  Am  =  g(f(So),  Um). 
Upon  making  their  decisions,  the  sensors  then  form  the  channel  inputs 

Xm[n]  =  r}AmUm[n\  .  (4.33) 

Let  T  denote  the  number  of  sensors  for  which  Am  =  Am.  Conditioned  on  the  value  of  So,  the  event 
of  each  sensor  being  aligned  correctly  depends  only  on  the  noise  value  at  that  sensor,  and  hence  is 
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an  independent  Bernoulli  random  variable  with  parameter  v(so),  so  T  is  a  binomial  random  variable 
with  parameters  (M,  v(sq)). 


The  distortion  achievable  after  the  feedback  is  given  by 


E[D  |  /(So)]  =  ES o 


M 


J2  L(M  -  2k,  M)P  (T  =  Jfe) 

Lfc=o 


where  the  expectation  is  taken  over  So  and 


P(r  =  k)=[s‘  |v(«o)*(i-„(»o)) 


M-k 


(4.34) 


(4.35) 


For  equation  (4.34)  to  converge  to  0,  each  term  in  the  summation  must  converge  to  0  as  M  — >  oo. 
The  convergence  is  in  turn  dependent  on  So  and  the  decision  rule  g(-)  by  which  the  sensors  estimate 
their  alignment.  Intuitively,  if  So  is  close  to  zero,  it  will  be  difficult  to  determine  Am  and  thus  the 
probability  v(So)  that  sensor  m  aligns  correctly  will  be  close  to  1/2.  We  therefore  wish  to  find 
a  decision  rule  g(-)  that  minimizes  the  probability  of  error  for  each  sensor,  or,  alternately,  that 
maximizes  the  probability  of  correct  alignment 

In  order  for  D(M )  =  0(M_1),  the  random  variable  T  must  be  bounded  away  from  1/2  with 
probability  one  over  the  possible  values  of  So-  Assume  that  VmiM^oc  F  >  1/2  +  e  with  probability 
one.  By  the  strong  law  of  large  numbers,  T/M  — ►  u(So)  with  probability  one,  which  implies 
v(Sq)  >  1/2  +  e  with  probability  one,  which  will  not  be  true  in  general.  However,  if  we  allow  e  to 
scale  with  M  as  well,  we  can  recover  some  of  the  scaling  rate,  as  shown  in  the  next  proposition. 

Proposition  10.  Suppose  the  feedback  function  /(•)  and  decision  ride  g(-)  are  chosen  such  that 
there  exist  constants  >  0,  e-2  >  0,  and  functions  a(M )  and  / 3(M )  such  that 


lim  a(M) 

=  0 

(4.36) 

M— KXD 

lim  (3(M) 

=  0 

(4.37) 

M— >oo 

lim  Mp(M )2 

=  oo 

(4.38) 

M— >  OO 

=>  v(Sq)  >  i  +  e2/3(M)  . 

(\S0\  >  eia(M)) 

(4.39) 

Then  as  M  — >  oo,  the  feedback/ decision  pair  ( f,g )  achieves  a  distortion  D(M)  in  the  transmission 
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phase  satisfying 


D(M)  =  O  (max  |a(M),  |)  .  (4.40) 

For  the  network  with  sign  fading. 

Remark.  Equation  (4.39)  says  that  if  the  source  sample  So  on  which  we  base  our  feedback  is 
large  enough,  e.g.  at  least  e\ a(M),  then  the  probability  of  successful  alignment  is  at  least  e2 P(M) 
more  than  1/2. 


Proof.  Suppose  that  (4.36)  -  (4.39)  hold.  Let  H  be  the  event  (|So|  >  e\a{M)).  Let  T  be  a  binomial 
random  variable  with  parameters  (M,  u(.so))  so  that  E[r]  =  Mv(sq).  We  have  the  following  bound: 


p  !r  - 


I  > 


<  exp  (  --elMP(M)2 


We  can  also  write  the  following  bounds,  using  the  assumptions  in  (4.39): 


P{HC )  = 


/eia(M) 
-e\ a(M) 


27T(j|  ^  ~ 


e-^sdx 


< 


TUJl 


-ei  a(M) 


v  (So  \H)  >  -  +  e2/3 (M)  . 


(4.41) 

(4.42) 


(4.43) 

(4.44) 

(4.45) 


We  evaluate  the  distortion  separately  on  the  events 

Fi  =  Hc 

F2  =  H  n  {T  >  (e2/2)M(3(M)) 

F3  =  H  n  (r  <  (e2/2)M/3(Af))  . 

On  Hc  we  upper  bound  the  distortion  by  letting  T  =  M/2.  On  H ,  we  have  equation  (4.45),  so  we 
can  let  V  =  M/2  on  P2  and  r  =  \  +  (e2/2 )M/3(M)  on  F3.  Putting  this  together  and  noting  that 
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the  distortion  must  be  less  than  we  rewrite  (4.34)  as  : 


Es0  [L(M  -  2T,M)(lFl  +  1  f2  +  1f3)] 

<  4  ( P{HC )  +  P{T>  (e2/2)  |  H))  +  L(Me2(3(M)/ 2,  M)P  ((r  <  (e2/2)  |  H ) 


<  4 


eia(Af)  +  exp  --e2Mf3(M)2  +  L(Me2P(M)/ 2,  M)  . 


The  first  term  converges  to  zero  with  the  slower  of  a{M)  and  exp(— M (3(M)2).  The  second  term 
converges  to  zero  as  0(M-1/3(M)-2).  Since  /3(M)~2  is  sub-linear,  we  can  ignore  this  latter  term, 


so  the  distortion  is  0(max{a(M),  M  1  /3(M)  2}). 


The  bounds  for  this  proof  depend  only  one  the  relationship  between  the  values  of  S o  and 
the  success  probability  v(Sq).  The  latter  depends  only  on  the  noise  distribution,  and  thus  it  is 
possible  to  treat  non-Gaussian  noises,  although  in  this  case  the  destination’s  linear  estimator  will 
not  necessarily  be  the  MMSE  estimator. 

Proposition  11.  Suppose  the  fading  observation  functions  satisfy  the  conditions  of  Proposition  9. 
Then  the  K-bit  feedback  scheme  outlined  above  achieves  a  distortion  that  scales  like  0(M~K^K+2')). 


4.3.2  Example:  feedback  from  a  beacon  sensor 
Example:  perfect  feedback 

To  gain  further  insight,  let  us  consider  an  idealized  case  in  which  /(•)  is  the  identity  function,  so 
that  the  sensors  get  to  know  the  source  sample  exactly;  we  call  this  perfect  feedback.  We  emphasize 
that  this  feedback  is  only  available  during  the  discovery  phase  and  not  for  all  time,  and  that  we 
are  assuming  that  the  discovery  phase  is  one  sample  long. 

Conditioned  on  the  value  of  So  =  so,  each  sensor  is  left  with  the  problem  of  detecting  antipodal 
signals  ±so  in  the  presence  of  Gaussian  noise  with  a  uniform  prior.  The  maximum  a  priori  proba¬ 
bility  (MAP)  rule  for  this  problem  is  a  threshold  test  at  0;  for  sq  >  0,  if  Um  >  0  then  Arn  =  1,  and 


59 


for  so  <  0  if  Urn  >  0  then  Arn  =  —  1.  The  probability  of  success  for  this  rule  is 

fpOo)  =  1  -  Q  f )  ,  (4.46) 

\  &W  J 

where 

1  /*°° 

Q{x)  =  -j=  /  e~y2,2dy  .  (4.47) 

V  J x 

The  success  probability,  conditioned  on  H  =  |5o|  >  6m,  is 

vp(S0\H)>l+  (4.48) 

2  2^/2^ 

Proposition  10  tell  us  that  the  distortion  scales  faster  than  max{eM,  Equating  these  two 

scaling  rates,  we  can  set  6m  =  0(Af-1/3)  to  yield  a  scaling  rate  of  0(Af-1/3). 

Suppose  instead  that  the  sensors  do  not  receive  So,  but  instead  a  one  bit  quantization  of  So,  or 
/(So)  =  sgn(So).  Since  the  the  MAP  rule  with  full  knowledge  of  So  was  a  threshold  test  at  0  for 
all  values  of  So,  the  MAP  rule  for  this  case  is  the  same.  Another  way  to  phrase  the  decision  rule 
is  if  /(So)  =  sgn(Um)  then  Am  =  1,  otherwise  Am  =  — 1.  We  again  condition  on  the  value  of  So, 
which  gives  the  the  same  success  probabilities  as  (4.46)  and  (4.48).  Therefore  the  distortion  with 
this  one  bit  of  feedback  is  also  It  is  interesting  to  note  that  in  this  case  the  achievable 

distortion  scaling  does  not  depend  on  the  “richness”  of  the  actual  available  feedback. 


Example:  one  bit  of  feedback 


Suppose  instead  that  each  sensor  is  given  access  to  the  sign  of  the  signal  received  at  an  extra 
sensor  b: 

/(So)  =  Gb  =  sgn(AbS0  +  Wb)  .  (4.49) 


Sensor  m  must  then  decide  if  Am  =  Ab  or  Am  A  Ab.  Again,  we  condition  on  So  =  sq-  The 
probability  that  Gb  =  Ab  is  given  by 


P  (Gb  =  Ab  |  So  =  .so )  =  1  —  Q 


l£o| 


(4.50) 


Since  we  cannot  exactly  know  the  signs  Ab  and  Am ,  we  assume  without  loss  of  generality  that 
Ab  =  1  and  attempt  to  distinguish  between  the  two  hypotheses  Arn  =  Ab  and  Am  =  —  Ab. 
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Suppose  we  have  full  knowledge  of  Ub  =  AbSo  +  14' 5-  Under  the  hypothesis  Am  =  A/,,  the  pair 
(Abso  +  4Uo,  Amso  +  Wm)  is  jointly  Gaussian  with  mean  (so,  so)-  Under  the  hypothesis  Am  =  —At, 
they  are  jointly  Gaussian  with  mean  (so,  —  -so)-  The  decision  rule  in  this  case  is  again  a  threshold 
test  on  the  line  Um  =  0.  If  Um  and  Ub  have  the  same  sign,  then  sensor  m  guesses  Am  =  Ab  =  1, 
and  if  they  have  different  sign  it  guesses  Am  =  —At,  =  —1.  This  rule  is  again  indifferent  to  the 
value  of  Sq,  as  in  the  perfect  feedback  case. 


The  decision  rule  for  Am  that  maximizes  a  posteriori  probability  of  the  observations  is  therefore 
given  by 

( 

g{Gb ,  Um)  =  1  if  Gb  =  sgn (Um) 

g{Gb ,  Um)  =  -  l  if  Gb  /  sgn (Um)  , 
regardless  of  the  value  of  Sq.  The  probability  of  success  is  given  by 


A-m  —  \ 


(4.51) 


un(s0)  =  1  -  2Q  (  M  )  Q 

(Jo 


SO 


(Jo, 


(4.52) 


So  the  conditional  probability  of  success  is: 


|5ol>f)U  + 


+ 


2ttct 


(4.53) 


The  6m  term  is  dominant  as  M  — »  00  in  this  conditional  probability,  the  same  as  in  the  perfect 
feedback  case.  From  our  previous  analysis,  we  can  see  that  D(M )  =  0(M -1//3). 

The  noisy  feedback  example  models  a  situation  in  which  one  sensor  acts  as  a  “beacon”  by 
broadcasting  the  sign  of  its  observation  to  the  other  sensors.  In  a  scaling  sense,  the  sign  of  the 
noisy  observation  is  “as  good”  as  knowing  the  sign  of  the  source  sample  exactly,  although  the 
constants  in  the  convergence  become  worse  as  the  noise  becomes  more  severe. 


4.3.3  Many  bits  of  feedback 

We  now  consider  the  effect  of  lengthening  the  discovery  phase  by  allowing  the  feedback  involve 
more  than  one  sample  of  the  source.  Let  So  =  S[0]  and  S_i  =  S[—  1]  be  the  two  source  samples 
on  which  we  base  our  feedback  /(Sq,S_i).  Conditioned  on  a  realization  (sq,S-i)  of  these  two 
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Figure  4.3.  MAP  rules  for  perfect  feedback  and  perfect  sign  feedback.  The  plus  is  the  noisy  ob¬ 
servation  (fxm[0],  um[—  1]).  Under  perfect  information,  Am  =  1,  whereas  with  only  sign  information 
the  expected  probability  of  success  is  maximized  when  Am  =  —  1. 


samples,  the  sensor  observations  are  again  independent,  so  each  sensor  should  seek  to  maximize  its 
own  probability  of  success.  To  illustrate  the  effect  of  adding  more  feedback,  we  consider  the  perfect 
feedback  of  Section  4.3.2  to  simplify  the  expressions.  A  similar  analysis  can  be  carried  out  for  the 
noisy  feedback  case. 


Suppose  our  feedback  is  the  pair  (So,  S’-!).  Conditioned  on  the  values  of  So  and  S_i,  sensor 
m’ s  observations  (f7m[0],  Um[—  1])  are  jointly  Gaussian  with  mean  (so,S-i)  under  the  hypothesis 
Am  =  1  and  mean  (— so,— s~i)  under  the  hypothesis  Am  =  —1.  The  problem  is  again  the  same 
as  that  of  detecting  antipodal  signals  in  the  presence  of  Gaussian  noise,  and  the  MAP  rule  is  a 
threshold  test  shown  in  Figure  4.3.  The  probability  of  success  for  this  rule  is 


■Up(so,si)  =  1  -  Q 


•so  +  s-i 


(7  n , 


(4.54) 


Let  H  be  the  event  ( | S*o |  <  ea(M),  ISTil  <  ea(M)),  and  Hc  its  complement.  Then  we  have: 

vp(S0,S^\  Hc)  >  (4.55) 

V  I 

>  ]-  +  ea(M) — 1  (4.56) 

2  \Jttctiy 

Since  Sq  and  S- \  are  independent,  P(H )  is  proportional  to  a(Af)2.  The  analysis  in  Proposition 
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10  implies  that  D(M)  =  0(max{a(M)2,  Ma(M )  2}).  Setting  these  two  terms  equal  gives  a{M)  = 
M~a  so  D(M)  =  0(M-1/2). 

Extending  the  above  analysis  in  a  standard  way  shows  that  with  K  samples  of  feedback  we  get 
distortion  D(M )  =  0{M~K^K+2^).  By  choosing  K  arbitrarily  large,  we  get  closer  to  the  optimal 
rate  of  0(M_1). 

Suppose  we  only  get  the  signs  of  So  and  S- 1,  so  that  f(So,S-\)  =  (sgn(S'o), sgn(5_i)).  The 
threshold  test  in  the  MAP  rule  for  perfect  feedback  depends  on  the  actual  values  of  (so,s~i),  as 
opposed  to  the  threshold  test  when  K  =  1.  Thus  the  scaling  result  above  does  not  immediately 
follow.  Sensor  m  must  then  determine,  based  on  the  observation  pair  (J7m[0],  Um[—  1]),  whether 
Am  =  1  or  Am  =  —1  in  a  way  that  maximizes  its  probability  of  making  a  correct  decision.  We 
would  like  for  the  probability  of  success  to  be  greater  than  1/2. 

Under  the  two  hypotheses,  the  pair  (Um[ 0],  Um[—  1])  is  jointly  Gaussian  with  mean  (sq,s_i) 
for  Arn  =  1,  mean  (— so,— S-i)  for  Am  =  —1  and  covariance  cr2vI.  The  likelihood  of  observing 
(£/m[0],  Um[—  1])  is  the  expectation  of  the  conditional  likelihood  over  all  source  pairs  (so,s-i)  that 
could  have  generated  /(so,s-i).  Because  of  the  symmetry  in  the  distribution  of  (So,5_i),  the 
likelihoods  under  the  two  hypotheses  are  equal  on  the  line  orthogonal  to  the  vector  /(so,  S-i)  so 
the  MAP  estimate  is  a  threshold  on  that  line. 

To  illustrate  the  difference  between  the  MAP  estimate  for  the  case  of  perfect  feedback  versus 
the  case  of  sign  feedback,  consider  Figure  4.3.  For  a  fixed  (so,  s_i),  the  probability  of  error  is  given 
by  the  probability  that  the  noise  exceeds  the  distance  from  the  point  (so,s-i)  to  the  threshold  in 
the  direction  orthogonal  to  the  decision  boundary: 

vp(s0,  s_i)  =  1  -  Q  (4.57) 

This  differs  from  equation  (4.54)  by  a  shift  in  the  norm  inside  the  Q(-)  function;  with  perfect 
knowledge  of  the  samples  we  have  a  £2  norm,  and  with  only  the  sign  we  have  C\.  This  explains 
why  the  tests  and  errors  were  the  same  in  the  case  when  K  =  1. 

Let  H  be  the  event  that  |5q|,  |<S'-i|  <  ea(M )  and  Hc  its  complement.  Then  we  can  bound  the 
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probability  of  success: 


Vp(s o,s_i  \HC)  >  t  H - ‘ — a(M)  (4.58) 

2  sf<- 

Thus  from  our  previous  analysis,  D(M )  =  0(M -1/2)  as  in  the  case  when  we  have  perfect  knowledge 
of  the  source  samples. 

4.3.4  Feedback  for  bounded  scalar  fading 

As  an  extension  to  these  results  on  sign  fading,  we  can  modify  the  scheme  for  bounded  real 
scalar  fading.  In  order  to  prevent  excess  noise,  we  can  set  a  threshold  of  e.  All  sensors  which 
estimate  their  gain  \Am\  <  e  do  not  transmit  anything.  Since  the  gains  are  iid  this  affects  at  a 
constant  fraction  P(\Am\  <  e)  of  the  sensors  almost  surely  as  M  — >  oo.  Consider  the  case  of  1-bit 
feedback.  Suppose  the  beacon  broadcasts  the  sign  of  its  observation  Gb  at  time  1  to  all  the  other 
sensors: 

Gb  =  sgn(C/0[l])  =  sgn(AfeS'[l]  +  W0[l])  (4.59) 

We  call  this  signal  the  feedback  function  f(Ub) 

Sensor  m  then  checks  if  Gb  =  Gm  (defined  analogously).  If  so,  it  estimates  sgn(Am)  =  sgn(Ab), 
otherwise  it  estimates  sgn(Am)  =  —  sgn(A&).  Call  this  decision  rule  g(Gb,Um).  This  rule  is  the 
maximum  a  posteriori  probability  (MAP)  rule  for  detecting  the  sign  of  Am.  Denote  by  ^(S^l]) 
the  probability  of  successfully  estimating  sgn(Am)  at  sensor  m  using  this  rule.  Conditioned  on 
<S[1],  all  the  sensor  observations  are  independent.  We  have  already  assumed  that  Am  >  e,  so 
('S'  [1] )  >  1/2  +  5  for  all  m.  We  have  the  same  proposition  relating  the  scaling  rate  to  the  to  the 
success  probability: 

Proposition  12.  Suppose  the  feedback  function  /(•)  and  decision  ride  <?(•)  are  chosen  such  that 
there  exist  constants  ei  >  0,  e-2,  >  0,  and  functions  a(M )  and  fi{M)  such  that 

lim  a(M)  =  0 

M— >oo 

lim  d(M)  = 

M— >oo 


0 
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(4.60) 

(4.61) 


lim  M/3(M)2  =  oo  (4.62) 

M— >oo 

(|5'[1]|  >  e\a(M))  =s-  v(S[l])  >  ±  +  e2(5(M)  .  (4.63) 

Then  as  M  — >  oo,  the  feedback/ decision  pair  ( f,g )  achieves  a  distortion  D(M )  in  t/ie  transmission 
phase  satisfying 

D(M)  =  0(m“  {“(")' '  <4  64) 
/or  f/ie  Gaussian  network  with  bounded  real  scalar  fading. 

We  then  have,  by  application  of  the  previous  proposition,  that  a  single  bit  of  feedback  in  the 
mode  we  have  described  biases  the  sum  of  the  forwarded  observations  into  a  regime  that  scales 
faster  than  \/~M ,  which  then  avoids  the  central-limit  behavior  of  the  misaligned  situation.  It  is 
clear  from  these  examples  that  the  situations  in  which  the  fading  process  {Am}  is  problematic  are 
when  Am  is  zero-mean  and  symmetric.  In  these  cases  the  sign  of  Am  plays  a  key  role.  This  can  be 
thought  of  as  a  phase  uncertainty  introduced  by  the  fading  observations. 

4.3.5  A  conjecture  for  the  CEO  problem  with  limited  feedback 

The  obvious  question  to  ask  at  this  point  is  how  separate  source  and  channel  coding  is  affected  by 
the  introduction  of  similar  feedback  capabilities.  Although  we  cannot  provide  a  definitive  answer 
at  this  time,  recent  results  on  the  CEO  problem  [29],  [31],  [38],  [39]  suggest  that  the  feedback 
signal  will  not  affect  the  scaling  behavior  of  the  distortion-rate  function.  We  will  briefly  sketch  an 
argument  along  these  lines,  following  the  very  recent  work  of  Wagner  [39]. 

We  can  bound  the  performance  of  a  CEO  code  with  a  beacon  by  changing  the  model  slightly. 
Let  us  suppose  there  are  M  +  1  sensors  numbered  0, 1,  2, ...  A I  and  let  Um  be  defined  as  before. 
Now  suppose  Ao  =  1  and  let  the  input  to  each  sensor’s  encoder  be  the  pair  (Um,  11/  for  m  = 
1,2, .. .  M.  We  also  give  the  signal  Ub  to  the  decoder.  Thus  the  entire  problem  has  been  reduced 
to  a  “conditional  CEO”  problem  with  all  terminals  sharing  a  side  information  signal  Ub,  and 
the  observed  variables  at  the  encoders  are  conditionally  independent  given  the  pair  ( S,Ub ).  By 
evaluating  the  bound  in  [39]  we  can  obtain  a  new  set  of  achievable  rates  for  this  problem. 
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It  is  intuitively  plausible  that  this  coding  problem  is  no  different  from  the  original  CEO  problem 
save  for  an  extra  conditioning  on  Ub  that  will  only  affect  the  distortion  achievable  at  a  given  rate  by 
a  constant.  Furthermore  it  is  clear  that  this  problem  will  give  a  lower  bound  on  the  distortion-rate 
function  for  the  case  with  feedback  because  we  do  not  even  assume  a  rate-limitation  on  the  side 
information  and  even  provide  it  to  the  decoder.  We  leave  the  proof  of  this  conjecture  for  future 
work. 

4.4  Other  directions  and  our  example 

We  now  turn  to  Gaussian  networks  in  which  the  source  is  observed  through  a  class  of  linear 
filters.  Locally  to  the  sensors,  there  are  two  sources  of  misalignment  in  these  networks:  phase 
uncertainty,  and  delay  uncertainty.  The  former  refers  to  relative  phase  differences  between  sensors 
with  the  same  power  spectrum,  and  is  a  generalization  of  the  problem  in  the  previous  section.  The 
latter  refers  to  integer  delays  in  the  observation  of  the  source.  This  may  be  caused  by  propagation 
delays  or  a  lack  of  clock  synchronization. 

Suppose  the  observation  ensemble  A  is  a  set  of  filters  {A^  [to]}  such  that  the  power  spectrum  of 
the  filtered  source  A^[n]  *  S [n]  is  nowhere  zero.  Each  sensor  receives  the  source  filtered  by  one  of 
the  filters  in  A,  where  Am [n]  is  chosen  uniformly  from  A.  The  sensors  can  attempt  to  empirically 
estimate  their  observation  function’s  power  spectral  density  and  compensate  for  it  by  whitening 
their  observations.  For  example,  if  A  =  {±1  ±  then  the  power  spectral  density  for  sensor 

rn  could  be 

Um(e^)  =  crjy  +  ^1  +  —  ^  (7g  +  crgcoscj  (4.65) 

for  Am(z)  =  ±(1  +  5-2"1)  or 

Um(ejn  =  +  ^1  +  — ^  —  cr|  COSCJ  (4.66) 

for  Am(z)  =  ±(1  -  ^_1). 

In  this  example,  some  of  the  Liters  have  the  same  power  spectrum  but  different  phases.  The 
ambiguity  in  phase,  which  in  this  case  is  again  a  sign  change,  cannot  be  distinguished  by  the  sensors 
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locally.  If  feedback  were  provided  in  the  same  way  as  in  the  previous  section,  we  might  be  able  to 
align  the  sensors  to  enable  coherent  communication.  In  this  case,  one  sensor  from  each  of  the  two 
spectral  classes  would  broadcast  the  sign  of  its  observation  -  by  comparing  signs  we  can  achieve 
the  same  alignment  as  before. 


More  generally,  consider  a  arbitrary  collection  of  causal  finite  impulse  response  (FIR)  filters  A 
with  identical  power  spectra,  and  which  all  have  nonzero  impulse  response  at  0.  The  sensors  cannot 
distinguish  between  these  filters  locally.  Suppose  the  set  of  indistinguishable  filters  A  depends  on 
L  +  1  source  samples: 

L 


A(en  =  Y,ake-j“k  ■ 

k= 0 


(4.67) 


Then  the  power  spectrum  of  the  sensor  ra’s  observed  process  is 


Um(eju>)  =  ^  ^  akak_i  cos  ku  +  a^y  . 

k=0  V i=o  / 


(4.68) 


However,  the  only  way  for  two  filters  ^4i[n]  and  A2M  to  induce  the  same  power  spectrum  U(eJU)) 
is  for  the  coefficients  of  cos  kio  to  all  be  equal.  This  in  turn  implies  that  a±k  =  —ci2k  for  all  k,  so 
the  ambiguity  is  at  most  between  a  filter  and  its  negative. 


The  results  of  the  previous  section  show  that  phase  misalignment  can  be  catastrophic  for  the 
uncoded  transmission  scheme,  but  a  small  amount  of  feedback  can  recover  some  of  the  performance 
and  outperform  the  best  separation-based  approach  if  the  ambiguity  takes  the  form  of  a  sign  shift. 
For  ensembles  of  real  filters,  we  have  shown  that  indeed  this  ambiguity  is  at  most  a  sign  shift.  To 
help  align  the  sensors  with  a  given  filter,  we  use  the  same  beacon  strategy  described  earlier.  This 
allows  a  sufficient  fraction  of  the  sensors  to  align  themselves  with  high  probability,  allowing  for 
further  processing  to  enable  coherent  communication. 


The  second  source  of  ambiguity  that  can  arise  with  linear  filters  is  in  the  absolute  delay.  Suppose 
A  =  { 1 ,  a:- 1 } ,  so  that  some  sensors  observe  the  source  with  unit  delay.  Suppose  furthermore  that 
Pa(  1)  =  Pa(z-1)  =  1/2.  If  the  sensors  forward  their  observations  uncoded,  the  received  signal 
Y[n]  is  given  by: 


Y[n\  =  (b0S[h}  +  BiS[n  -  1]  +  ^  Wm[n]  \  +  Z[n]  , 

\  m=  1  ) 


(4.69) 
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where  Bq  is  the  number  of  sensors  with  zero  delay,  B±  is  the  number  sensors  with  unit  delay,  and 


77  is 


(4.70) 


The  destination  can  use  a  smoothing  filter  to  estimate  the  source  sequence.  Using  a  Wiener 

filter,  the  power  spectrum  of  the  error  is  given  by 

_ (Mr/2cj^  +  a2s)a2s _ 

(B%  +  Bf  +  2B0Bi  cos(j)?72(j|  +  +  <r| 

This  error  function  is  different  from  that  in  the  sign-mismatch  case,  but  also  suffers  from  the  effects 

of  phase  mismatch  from  the  cos  a;  term.  However,  a  feedback  scheme  which  can  bias  the  sensor’s 

estimate  of  their  own  delay  can  shift  the  scaling  rate  in  denominator  of  the  integral. 


(4.71) 


e(e 


JU\ 


Returning  to  our  canonical  example,  we  can  note  that  it  is  quite  similar  to  the  delay  example 
above.  If  there  are  two  channel  uses  per  source  sample,  the  sensors  can  use  a  feedback  scheme  to 
estimate  which  source  they  are  observing  and  slot  their  transmissions  accordingly.  Unfortunately,  as 
long  as  a  fraction  of  sensors  are  misaligned ,  their  contributions  will  result  in  a  coherent  interference 
that  scales  as  fast  as  the  correctly  aligned  sensors.  To  be  more  precise,  we  would  hope  to  get  a 
distortion  of  the  form 


D  =  E 


=  2 


°2saw 


Bo  i  _2 

Bo+{4/°'ir)ri-2<7s  +  W 

aSa W 


+ 


Bl  2  i  2 

Bi+(°i/°ir)n-a<TS  +  <TW_ 


(M/2)2  2  ,  2 

(M/2)+(4/Or2<7s  +  aw 


(4.72) 

(4.73) 


With  the  feedback,  a  portion  /3  of  the  sensors  will  remain  misaligned  (in  either  direction),  so  we 
would  have  in  expectation: 


2  _2 


D  =  2 


( TqCT 


suw 


m/2-p)2*2 

{M/2)a2w+M2024+^sV-2°S 


(4.74) 


One  way  around  this  is  to  allow  the  amount  of  feedback  allowed  to  increase  with  M.  We  leave  this 
for  future  work. 


These  examples  suggest  that  phase  uncertainty  can  render  uncoded  schemes  useless,  but  a  little 
bit  of  feedback  goes  a  long  way  in  terms  of  scaling  rate.  For  complex  signals  with  continuous 
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phase  differences,  the  efficacy  of  this  one-shot  feedback  scheme  may  be  more  limited.  However, 
modifying  the  scheme  so  that  the  beacon  opportunistically  waits  for  a  good  signal  may  allow  a 
tradeoff  between  the  length  of  the  discovery  period  and  the  achievable  distortion  scaling. 


69 


Chapter  5 


Coda 


The  current  surge  in  research  attention  on  sensor  networks  is  fueled  by  the  development  of 
practical  platforms  for  implementing  a  wide  range  of  applications.  These  applications  range  from 
military  surveillance,  environmental  and  industrial  monitoring,  and  traffic  regulation  to  “smart 
homes,”  commercial  robotics,  and  biomedical  sensing.  In  many  cases  the  network  is  used  to  gather 
data  in  order  to  estimate  some  underlying  variable.  Because  of  the  inherent  unreliability  in  sensor 
placement  and  physical  modeling,  robust  estimation  systems  are  needed  in  order  to  deal  with  real- 
world  applications.  In  this  thesis  we  took  a  simple  sensor  network  model  and  introduced  structured 
uncertainty  in  the  observation  model.  This  structured  uncertainty  took  the  form  of  an  unknown 
correlation  between  the  observed  variables.  Our  objectives  were  to  examine  the  performance  of 
existing  estimators  and  coding  schemes  and  to  propose  new  tradeoffs  and  protocols  to  improve  the 
performance  of  these  schemes  in  the  large-network  limit. 

The  data-gathering  sensor  networks  we  study  have  three  main  behaviors  -  an  communication, 
distributed  processing,  and  estimation.  We  examined  these  in  the  reverse  order.  From  the  per¬ 
spective  of  centralized  estimation,  we  looked  at  linear  observation  models  with  bounds  on  norms, 
structural  constraints,  and  random  distributions  and  used  the  same  asymptotic  spectral  expansions 
to  find  the  performance  of  estimators  for  each  type.  On  the  distributed  coding  front,  we  looked  at 
universal  Slepian-Wolf  codes  and  showed  that  the  order  in  which  the  limits  are  taken  in  looking 


70 


at  large  block-length  codes  is  important.  Finally,  we  looked  at  the  effect  of  introducing  a  simple 
communication  channel  into  our  sensor  network.  Although  phase  uncertainty  rendered  the  uncoded 
transmission  protocol  useless,  we  exhibited  a  simple  one-time  feedback  protocol  that  can  bias  the 
network  to  recapture  some  of  the  performance  loss. 

The  expectation  among  engineers  is  that  an  optimal  algorithm  should  also  be  robust  to  per¬ 
turbations  in  the  modeling  assumptions.  In  sensor  networks  these  perturbations  may  not  be  small 
deviations  but  instead  larger  structural  uncertainties.  Therefore  robust  models  and  protocols  for 
sensor  networks  should  incorporate  some  of  this  uncertainty  or  provide  a  mechanism  by  which  it 
can  be  resolved.  In  this  thesis  we  attempted  to  provide  some  steps  towards  analyzing  the  effect 
of  this  uncertainty  for  Gaussian  networks.  Hopefully  some  of  the  problems  ideas  studied  here  will 
provide  some  insight  into  real  network  designs. 
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